Open source LLM engineering platform for observability, metrics, evals, and prompt management.
Pioneering Multimodal Reasoning with CoT, an open-source model for advanced visual and text reasoning.
A curated list of tools, datasets, demos, and papers for evaluating large language models (LLMs).
Sample notebooks and prompts for evaluating large language models (LLMs) and generative AI.
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Open-source framework for evaluating and testing AI and LLM systems for performance, bias, and security issues.
A unified toolkit for automatic evaluations of large language models (LLMs).
A study evaluating geopolitical and cultural biases in large language models through dual-layered assessments.
A comprehensive survey on benchmarks for Multimodal Large Language Models (MLLMs).