Kreuzberg
Kreuzberg is a Python library designed for efficient text extraction from various document formats, including PDFs, images, and office documents. It provides a unified interface for both synchronous and asynchronous text extraction, making it versatile for different use cases.
Key Features:
- Wide Format Support: Extract text from PDFs, DOCX, RTF, TXT, EPUB, and more.
- Multiple OCR Engines: Supports Tesseract, EasyOCR, and PaddleOCR for optimal text recognition.
- Local Processing: No need for external API calls or cloud dependencies, ensuring privacy and speed.
- Resource Efficient: Lightweight processing without GPU requirements.
- Metadata Extraction: Retrieve document metadata alongside the extracted text.
- Table Extraction: Utilize the GMFT library for extracting tables from documents.
- Modern Python: Built with async/await, type hints, and a functional-first approach.
Benefits:
- Simple and Hassle-Free: Clean API that just works without complex configuration.
- Open Source: Released under the MIT license, encouraging contributions and community involvement.
Getting Started:
To install Kreuzberg, use the following command:
pip install kreuzberg
For comprehensive documentation, visit our GitHub Pages.