
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Open-source framework for evaluating and testing AI and LLM systems for performance, bias, and security issues.

Phoenix is an open-source AI observability platform for experimentation, evaluation, and troubleshooting.

A unified toolkit for automatic evaluations of large language models (LLMs).

An open-source project for comparing two LLMs head-to-head with a given prompt, focusing on backend integration.

A comprehensive survey on benchmarks for Multimodal Large Language Models (MLLMs).

A serverless tool for converting various file types to Markdown using Cloudflare Workers and AI.

中文法律对话语言模型,旨在为法律问题提供专业可靠的回答。

Automatable GenAI Scripting for programmatically assembling prompts for LLMs using JavaScript.

Custom AI assistant platform to speed up your work.

Build production-ready AI agents in both Python and Typescript.

A powerful framework for building realtime voice AI agents.