
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Open-source framework for evaluating and testing AI and LLM systems for performance, bias, and security issues.

Phoenix is an open-source AI observability platform for experimentation, evaluation, and troubleshooting.

A unified toolkit for automatic evaluations of large language models (LLMs).

An open-source project for comparing two LLMs head-to-head with a given prompt, focusing on backend integration.

A study evaluating geopolitical and cultural biases in large language models through dual-layered assessments.

中文法律对话语言模型,旨在为法律问题提供专业可靠的回答。

Automatable GenAI Scripting for programmatically assembling prompts for LLMs using JavaScript.

Prompty simplifies the creation, management, debugging, and evaluation of LLM prompts for AI applications.

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.