A community language data initiative

Every Language Deserves
a Digital Future

Donate text, speech, and documents in Pashto, Urdu, Punjabi, Sindhi, Balochi, Hindko, Saraiki, Brahui, Kashmiri — and help build the speech recognition, translation, and AI systems these languages have been waiting for.

Donate Data Learn More

Training AI for Pakistani languages · Every dialect welcome

Our Mission

Languages spoken by hundreds of millions are nearly invisible to modern AI.

The languages of Pakistan and its neighboring regions carry centuries of poetry, scholarship, and oral tradition — yet most have almost no presence in the datasets that train today's AI. Awaz is where speakers, teachers, writers, and researchers come together to build the data needed to train AI for these languages.

Supported Languages

Nine languages today — yours can be next

Pashto

Khyber Pakhtunkhwa, Balochistan, Afghanistan

پښتو

Urdu

Pakistan (national)

اردو

Punjabi

Punjab

پنجابی

Sindhi

Sindh

سنڌي

Balochi

Balochistan, Iran, Afghanistan

بلۏچی

Hindko

Khyber Pakhtunkhwa, Punjab

ہندکو

Saraiki

South Punjab

سرائیکی

Brahui

Balochistan

براہوئی

Kashmiri

Kashmir

کٲشُر

More languages are added over time — and yours could be next.

Why It Matters

Languages thrive when machines can speak them too

Hundreds of millions of speakers, barely heard by AI

Speech and translation systems underperform in low-resource languages because high-quality training data barely exists. Every donated sentence narrows that gap.

Preservation through digitization

Manuscripts, folk literature, and regional dialects risk fading away. Digitizing them protects cultural memory while making it usable for research.

Technology that serves communities

Better language AI means education tools, accessibility software, and information access in the languages people actually speak at home.

How Data Donation Works

From your voice to a language's future — four steps

1
Create your account
Sign up with Google or email. Your contributions stay linked to your profile.
2
Pick your language
Choose from nine supported languages — every dialect and accent is valuable.
3
Contribute in minutes
Type, record in your browser, or drag and drop files. No technical knowledge needed.
4
Power language AI
Contributions become organized, documented datasets for training speech, OCR, translation, and LLMs.

Contribution Categories

Four ways to strengthen the commons

Text Data

Sentences, articles, stories, poetry, proverbs, everyday writing.

Examples: Folk tales, news articles, social posts, school essays
Impact: Trains language models and translation systems.
Formats: Typed · TXT · CSV · XLSX · PDF · DOCX

Voice Data

Your voice speaking your language — with an optional transcription.

Examples: Read sentences, retold stories, dialect samples
Impact: Powers speech recognition (ASR) and natural text-to-speech.
Formats: Browser recording · MP3 · WAV · M4A · OGG

Documents

Books, manuscripts, newspapers, scanned pages, educational material.

Examples: Digitized books, magazines, letters, archives
Impact: Enables OCR and recovers hard-to-find written heritage.
Formats: PDF · DOCX · TXT · CSV · XLSX · ZIP

Datasets

Structured collections — parallel corpora, word lists, labeled data.

Examples: Translation pairs, lexicons, annotated corpora
Impact: Gives researchers ready-to-use building blocks for NLP.
Formats: CSV · JSON · JSONL · XLSX · ZIP

Community Impact

What your contribution makes possible

Speech recognition
Voice assistants, automatic subtitles, and hands-free tools for low-literacy users — in nine languages and counting.
Machine translation
Bridges between regional languages, Urdu, and English — for health, law, and education.
OCR & digitization
Searchable archives of books and manuscripts that today exist only on paper.
Language models
Chat, writing, and learning tools that are fluent in the languages people live in.

“When a language enters the digital world, its speakers enter with it.”

Every sentence, every minute of speech, and every scanned page helps train AI that can finally understand the languages of Pakistan.

FAQ

Common questions

Who can contribute?+

Anyone. Native speakers, heritage speakers, learners, teachers, writers, and researchers are all welcome. Every dialect and accent makes the data stronger.

Which languages are supported?+

Pashto, Urdu, Punjabi, Sindhi, Balochi, Hindko, Saraiki, Brahui, and Kashmiri at launch — with more languages added over time.

What happens to the data I donate?+

Your contributions are stored securely and organized by language. The long-term goal is clean, documented datasets used to train AI for speech, OCR, translation, and language-model research.

Do I need to speak the language perfectly?+

No. Real language — with regional accents, dialect words, and informal style — is exactly what AI systems need to learn.

Can I upload material I didn't write?+

Only if you have the right to share it — your own work, public-domain material, or content whose owner gave permission. Note the source in the upload form so it can be attributed correctly.

Is my personal information shared?+

No. Your account details stay private. Any data used for training will only ever include the language data itself, never personal details.

Your language. Your voice. Your contribution.

It takes two minutes to make your first donation — and every one counts.

Create Your Account

Every Language Deservesa Digital Future