The Best Your Ultimate AI Security Toolkit
Curated AI security tools & LLM safety resources for cybersecurity professionals
Curated AI security tools & LLM safety resources for cybersecurity professionals

Sample notebooks and prompts for evaluating large language models (LLMs) and generative AI.

The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Open-source framework for evaluating and testing AI and LLM systems for performance, bias, and security issues.

Phoenix is an open-source AI observability platform for experimentation, evaluation, and troubleshooting.

A unified toolkit for automatic evaluations of large language models (LLMs).

An open-source project for comparing two LLMs head-to-head with a given prompt, focusing on backend integration.

A study evaluating geopolitical and cultural biases in large language models through dual-layered assessments.

A comprehensive survey on benchmarks for Multimodal Large Language Models (MLLMs).

Open-source evaluation toolkit for large multi-modality models, supporting 220+ models and 80+ benchmarks.

Bare metal to production ready in mins; your own fly server on your VPS.