We've released multilingual journalism datasets from the People's Archive of Rural India (PARI).
Why This Matters
Indian languages represent some of the world's most widely spoken yet technologically underrepresented languages. While Hindi alone has over 600 million speakers and languages like Telugu, Tamil, and Marathi each have 70+ million speakers, they lack the quality datasets needed for training modern AI systems.
These PARI datasets help address this critical gap by providing:
- Authentic voices: Real journalism capturing rural India's stories, not synthetic or translated content
- Cultural context: Content grounded in the lived experiences of rural communities
- Linguistic diversity: Coverage across multiple scripts (Devanagari, Arabic, Gurmukhi, Tamil, Telugu, Gujarati)
- Quality over quantity: Professionally written, edited content from respected journalists
What is PARI?
The People's Archive of Rural India documents the lives, cultures, and stories of rural India through professional journalism. These stories, written in multiple Indian languages, capture rural life often overlooked in mainstream media.
Dataset Overview
We've compiled and processed journalism articles across 8 languages:
- Hindi (हिन्दी) - 2.1M tokens, 976 articles
- Urdu (اردو) - 2.4M tokens, 963 articles
- Punjabi (ਪੰਜਾਬੀ) - 4.3M tokens, 978 articles
- Marathi (मराठी) - 2.1M tokens, 974 articles
- Tamil (தமிழ்) - 2.2M tokens, 975 articles
- Telugu (తెలుగు) - 2.2M tokens, 743 articles
- English - 1.7M tokens, 965 articles
- Gujarati (ગુજરાતી) - 2.9M tokens, 976 articles
Articles Distribution by Language
Token Distribution
Datasets by Language
Sample Data
Explore the actual content from one of our datasets. Here's a preview of the Gujarati dataset showing article titles and content:
Dataset Features
Each dataset includes:
- Clean Text: HTML entity decoding and tag removal
- Script-Specific Normalization: Language-appropriate text processing
- Parquet Format: Efficient storage and fast loading
- Structured Data: Title and full article content
Use Cases
These datasets are designed for:
- Text generation and language modeling
- Cross-lingual research
- Cultural and linguistic analysis
- Training multilingual AI models
License & Usage
PARI Content is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).
You're free to share and distribute with attribution. Commercial use and derivatives require permission from PARI.
- Terms of Service
- Copyright Details
- Contact: contact@ruralindiaonline.org
We encourage you to:
- Cite PARI and this dataset when publishing research or building applications
- Support PARI's mission by becoming a volunteer or donating to fund their work
- Respect the journalism and maintain proper attribution to the original content creators
Get Started
All datasets are available on Hugging Face under open licenses. Visit the individual dataset pages for more details and download links.
Usage Example
Here's how to use these datasets with the Hugging Face datasets library:
from datasets import load_dataset
# Load the dataset (example: Hindi)
dataset = load_dataset("keplersystems/PAARI-Hindi")
# Access examples
for article in dataset["train"]:
print(f"Title: {article['title']}")
print(f"Text: {article['text'][:200]} ...")
Citation
If you use these datasets in your research, please cite:
@misc{pari-datasets-2025,
title={PARI Multilingual Datasets},
author={People's Archive of Rural India and Kepler Systems},
howpublished={\url{https://huggingface.co/keplersystems}},
year={2025}
}
Acknowledgments
These datasets are derived from the journalism work of the People's Archive of Rural India (PAARI) and Rural India Online. We thank all the journalists, photographers, and contributors who created this invaluable content documenting rural Indian life and culture.
PARI's mission to document and preserve stories from rural India has created a unique resource that not only serves journalism but also enables AI research and development for underrepresented languages and communities.