NiuTrans/Classical-Modern
NiuTrans/Classical-Modern is a comprehensive parallel corpus of Classical Chinese (古文) and Modern Chinese texts. This project aims to provide a rich resource for researchers and developers working with Chinese language processing.
Key Features:
- Extensive Corpus: Contains a vast collection of Classical Chinese texts, covering 327 classical works.
- Parallel Data: Offers sentence-level aligned bilingual data, with a total of 972,467 sentence pairs.
- Structured Organization: Texts are organized by chapters and sections, making it easy to navigate and access specific works.
- Data Processing Scripts: Provides scripts for data processing, ensuring reproducibility and ease of use.
Benefits:
- Research Resource: Ideal for linguists, historians, and AI researchers interested in Chinese language studies.
- Open Source: Freely available for contributions and improvements from the community.
- Detailed Documentation: Includes comprehensive documentation on data sources and processing methods.
Highlights:
- The corpus is sourced from the internet, ensuring a wide range of texts.
- All data is meticulously organized to maintain the original order of the Classical texts.
- Contributions from community members enhance the quality and breadth of the corpus.