LogoAISecKit

Kreuzberg

A text extraction library supporting PDFs, images, office documents and more.

Introduction

Kreuzberg

Kreuzberg is a Python library designed for efficient text extraction from various document formats, including PDFs, images, and office documents. It provides a unified interface for both synchronous and asynchronous text extraction, making it versatile for different use cases.

Key Features:
  • Wide Format Support: Extract text from PDFs, DOCX, RTF, TXT, EPUB, and more.
  • Multiple OCR Engines: Supports Tesseract, EasyOCR, and PaddleOCR for optimal text recognition.
  • Local Processing: No need for external API calls or cloud dependencies, ensuring privacy and speed.
  • Resource Efficient: Lightweight processing without GPU requirements.
  • Metadata Extraction: Retrieve document metadata alongside the extracted text.
  • Table Extraction: Utilize the GMFT library for extracting tables from documents.
  • Modern Python: Built with async/await, type hints, and a functional-first approach.
Benefits:
  • Simple and Hassle-Free: Clean API that just works without complex configuration.
  • Open Source: Released under the MIT license, encouraging contributions and community involvement.
Getting Started:

To install Kreuzberg, use the following command:

pip install kreuzberg

For comprehensive documentation, visit our GitHub Pages.

Newsletter

Join the Community

Subscribe to our newsletter for the latest news and updates