llama-swap
llama-swap is a lightweight, transparent proxy server designed for automatic model swapping with llama.cpp or any local OpenAPI compatible server. Written in Golang, it is easy to install and configure, requiring only a single binary and a simple YAML configuration file.
Key Features:
- Automatic Model Swapping: Automatically replaces the upstream server with the correct one based on the model requested.
- Simple Configuration: Uses a single YAML file for configuration, making it user-friendly.
- Multiple Model Support: Can handle multiple models simultaneously through profiles.
- Docker Support: Easily deployable using Docker, with pre-built images available.
- Health Monitoring: Includes health checks and logging capabilities for monitoring server status.
Benefits:
- Flexibility: Works with any OpenAI compatible server, not just llama-server.
- Performance Optimization: Supports speculative decoding and code generation optimization for improved inference speeds.
- Resource Management: Provides control over system resources and automatic unloading of models after a specified timeout.
Highlights:
- Supports various OpenAI API endpoints including completions, chat completions, embeddings, and more.
- Easy to deploy on bare metal or via Docker, with pre-built binaries available for multiple operating systems.
- Community-driven with active contributions and updates.