3FS: High-Performance Distributed File System
3FS (Fire-Flyer File System) is a high-performance distributed file system tailored for AI training and inference workloads. It leverages modern SSDs and RDMA networks to provide a shared storage layer that simplifies the development of distributed applications.
Key Features:
- Performance and Usability: Combines the throughput of thousands of SSDs and the network bandwidth of hundreds of storage nodes.
- Strong Consistency: Implements Chain Replication with Apportioned Queries (CRAQ) for strong consistency.
- File Interfaces: Develops stateless metadata services backed by a transactional key-value store.
- Diverse Workloads: Supports data preparation, dataloaders, and high-throughput parallel checkpointing.
- KVCache for Inference: Offers a cost-effective alternative to DRAM-based caching, enhancing throughput and capacity.
Benefits:
- Simplifies the development of distributed applications.
- Efficiently manages large volumes of data and intermediate outputs.
- Optimizes the LLM inference process, reducing redundant computations.
Highlights:
- Achieved an aggregate read throughput of approximately 6.6 TiB/s in stress tests.
- Successfully sorted 110.5 TiB of data in under 31 minutes using the GraySort benchmark.
For more information, visit the GitHub repository.