FlashMLA is an efficient MLA decoding kernel optimized for Hopper GPUs, delivering significant performance improvements.
A GitHub repository exploring LLMs as coding tutors with a focus on dialogue tutoring agents.
HeadInfer is a memory-efficient inference framework for large language models that reduces GPU memory consumption.
A knowledge-sharing platform about large language models for job interviews and general understanding.
LLM API management & key redistribution system for various AI models, supporting unified API access and easy deployment.
Wan2.1 is an open suite of advanced video generative models, enabling innovative video creation and editing.
Faster Whisper transcription with CTranslate2.
A high-throughput and memory-efficient inference and serving engine for LLMs.
Replace OpenAI GPT with another LLM in your app by changing a single line of code.
:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first.
SGLang is a fast serving framework for large language models and vision language models.