Kepler Systems

We've released multilingual journalism datasets from the People's Archive of Rural India (PARI).

8 Languages

7,650 Total Articles

19.9M Total Tokens

51.5 MB Total Size

Why This Matters

Indian languages represent some of the world's most widely spoken yet technologically underrepresented languages. While Hindi alone has over 600 million speakers and languages like Telugu, Tamil, and Marathi each have 70+ million speakers, they lack the quality datasets needed for training modern AI systems.

These PARI datasets help address this critical gap by providing:

Authentic voices: Real journalism capturing rural India's stories, not synthetic or translated content
Cultural context: Content grounded in the lived experiences of rural communities
Linguistic diversity: Coverage across multiple scripts (Devanagari, Arabic, Gurmukhi, Tamil, Telugu, Gujarati)
Quality over quantity: Professionally written, edited content from respected journalists

What is PARI?

The People's Archive of Rural India documents the lives, cultures, and stories of rural India through professional journalism. These stories, written in multiple Indian languages, capture rural life often overlooked in mainstream media.

Dataset Overview

We've compiled and processed journalism articles across 8 languages:

Hindi (हिन्दी) - 2.1M tokens, 976 articles
Urdu (اردو) - 2.4M tokens, 963 articles
Punjabi (ਪੰਜਾਬੀ) - 4.3M tokens, 978 articles
Marathi (मराठी) - 2.1M tokens, 974 articles
Tamil (தமிழ்) - 2.2M tokens, 975 articles
Telugu (తెలుగు) - 2.2M tokens, 743 articles
English - 1.7M tokens, 965 articles
Gujarati (ગુજરાતી) - 2.9M tokens, 976 articles

Articles Distribution by Language

Token Distribution

Datasets by Language

Sample Data

Explore the actual content from one of our datasets. Here's a preview of the Gujarati dataset showing article titles and content:

Dataset Features

Each dataset includes:

Clean Text: HTML entity decoding and tag removal
Script-Specific Normalization: Language-appropriate text processing
Parquet Format: Efficient storage and fast loading
Structured Data: Title and full article content

Use Cases

These datasets are designed for:

Text generation and language modeling
Cross-lingual research
Cultural and linguistic analysis
Training multilingual AI models

License & Usage

PARI Content is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).

You're free to share and distribute with attribution. Commercial use and derivatives require permission from PARI.

Terms of Service
Copyright Details
Contact: contact@ruralindiaonline.org

We encourage you to:

Cite PARI and this dataset when publishing research or building applications
Support PARI's mission by becoming a volunteer or donating to fund their work
Respect the journalism and maintain proper attribution to the original content creators

Get Started

All datasets are available on Hugging Face under open licenses. Visit the individual dataset pages for more details and download links.

Usage Example

Here's how to use these datasets with the Hugging Face datasets library:

from datasets import load_dataset

# Load the dataset (example: Hindi)
dataset = load_dataset("keplersystems/PAARI-Hindi")

# Access examples
for article in dataset["train"]:
    print(f"Title: {article['title']}")
    print(f"Text: {article['text'][:200]} ...")

Citation

If you use these datasets in your research, please cite:

@misc{pari-datasets-2025,
    title={PARI Multilingual Datasets},
    author={People's Archive of Rural India and Kepler Systems},
    howpublished={\url{https://huggingface.co/keplersystems}},
    year={2025}
}

Acknowledgments

These datasets are derived from the journalism work of the People's Archive of Rural India (PAARI) and Rural India Online. We thank all the journalists, photographers, and contributors who created this invaluable content documenting rural Indian life and culture.

PARI's mission to document and preserve stories from rural India has created a unique resource that not only serves journalism but also enables AI research and development for underrepresented languages and communities.

Introducing Curated PARI Datasets: 8 Indian Languages for AI Research