Tag
Explore by tags

Langfuse
Open source LLM engineering platform for observability, metrics, evals, and prompt management.

Skywork-R1V
Pioneering Multimodal Reasoning with CoT, an open-source model for advanced visual and text reasoning.

Awesome-LLM-Eval
A curated list of tools, datasets, demos, and papers for evaluating large language models (LLMs).

LLM-Evaluation
Sample notebooks and prompts for evaluating large language models (LLMs) and generative AI.

LLM-eval-survey
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

Evals
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Giskard
Open-source framework for evaluating and testing AI and LLM systems for performance, bias, and security issues.

Evalchemy
A unified toolkit for automatic evaluations of large language models (LLMs).

LLM-Bias-Evaluation
A study evaluating geopolitical and cultural biases in large language models through dual-layered assessments.

Evaluation-Multimodal-LLMs-Survey
A comprehensive survey on benchmarks for Multimodal Large Language Models (MLLMs).