Building AI that understands and works with Indian languages is harder than it looks. Here's what we've learned.

The English Bias

Most AI models are predominantly trained on English text. This creates several problems:

  • Tokenization Issues: Models treat Indian language words inefficiently
  • Cultural Context: Missing cultural nuances and context
  • Data Scarcity: Limited quality training data
  • Script Complexity: Multiple writing systems and diacritical marks

Why It Matters

India has:

  • 22 official languages
  • 19,500+ dialects
  • 1.4 billion people, many preferring their native language
  • Rich literary and cultural traditions in each language

Yet AI advancement has largely left these languages behind.

Our Approach

1. Data Collection

We focus on:

  • High-quality, verified content
  • Diverse sources (literature, journalism, social media)
  • Proper licensing and attribution
  • Community involvement

2. Proper Preprocessing

  • Script-specific normalization
  • Maintaining linguistic nuances
  • Handling mixed-script text
  • Preserving cultural context

3. Evaluation

Beyond perplexity scores:

  • Cultural appropriateness
  • Linguistic accuracy
  • Real-world usability
  • Community feedback

Challenges We Face

Technical

  • Limited compute resources
  • Lack of standard benchmarks
  • Tokenization inefficiency
  • Cross-script handling

Social

  • Data availability and licensing
  • Community trust and involvement
  • Balancing tradition with innovation
  • Ensuring inclusive representation

Progress So Far

We've released:

  • 8 PAARI language datasets
  • Poetry Llama for Urdu
  • UrduShers and UrduGhazals datasets

But this is just the beginning.

Looking Forward

Our goals:

  1. Expand to more Indian languages
  2. Build specialized domain models
  3. Create accessible tools
  4. Foster an open-source community
  5. Ensure ethical and inclusive AI

Join Us

Building truly multilingual AI requires:

  • Linguists and cultural experts
  • Machine learning researchers
  • Data contributors
  • Community feedback
  • Computational resources

Interested in contributing? Contact us at contact@kepler.systems