The Challenge of Building Multilingual AI

November 26, 2025

Why Indian language AI matters and the unique challenges we face in creating inclusive language models.

Building AI that understands and works with Indian languages is harder than it looks. Here's what we've learned.

The English Bias

Most AI models are predominantly trained on English text. This creates several problems:

Tokenization Issues: Models treat Indian language words inefficiently
Cultural Context: Missing cultural nuances and context
Data Scarcity: Limited quality training data
Script Complexity: Multiple writing systems and diacritical marks

Why It Matters

India has:

22 official languages
19,500+ dialects
1.4 billion people, many preferring their native language
Rich literary and cultural traditions in each language

Yet AI advancement has largely left these languages behind.

Our Approach

1. Data Collection

We focus on:

High-quality, verified content
Diverse sources (literature, journalism, social media)
Proper licensing and attribution
Community involvement

2. Proper Preprocessing

Script-specific normalization
Maintaining linguistic nuances
Handling mixed-script text
Preserving cultural context

3. Evaluation

Beyond perplexity scores:

Cultural appropriateness
Linguistic accuracy
Real-world usability
Community feedback

Challenges We Face

Technical

Limited compute resources
Lack of standard benchmarks
Tokenization inefficiency
Cross-script handling

Social

Data availability and licensing
Community trust and involvement
Balancing tradition with innovation
Ensuring inclusive representation

Progress So Far

We've released:

8 PAARI language datasets
Poetry Llama for Urdu
UrduShers and UrduGhazals datasets

But this is just the beginning.

Looking Forward

Our goals:

Expand to more Indian languages
Build specialized domain models
Create accessible tools
Foster an open-source community
Ensure ethical and inclusive AI

Join Us

Building truly multilingual AI requires:

Linguists and cultural experts
Machine learning researchers
Data contributors
Community feedback
Computational resources

Interested in contributing? Contact us at contact@kepler.systems