A community language data initiative
Every Language Deserves
a Digital Future
Donate text, speech, and documents in Pashto, Urdu, Punjabi, Sindhi, Balochi, Hindko, Saraiki, Brahui, Kashmiri — and help build the speech recognition, translation, and AI systems these languages have been waiting for.
Training AI for Pakistani languages · Every dialect welcome
Our Mission
Languages spoken by hundreds of millions are nearly invisible to modern AI.
The languages of Pakistan and its neighboring regions carry centuries of poetry, scholarship, and oral tradition — yet most have almost no presence in the datasets that train today's AI. Awaz is where speakers, teachers, writers, and researchers come together to build the data needed to train AI for these languages.
Supported Languages
Nine languages today — yours can be next
Pashto
Khyber Pakhtunkhwa, Balochistan, Afghanistan
پښتو
Urdu
Pakistan (national)
اردو
Punjabi
Punjab
پنجابی
Sindhi
Sindh
سنڌي
Balochi
Balochistan, Iran, Afghanistan
بلۏچی
Hindko
Khyber Pakhtunkhwa, Punjab
ہندکو
Saraiki
South Punjab
سرائیکی
Brahui
Balochistan
براہوئی
Kashmiri
Kashmir
کٲشُر
More languages are added over time — and yours could be next.
Why It Matters
Languages thrive when machines can speak them too
Hundreds of millions of speakers, barely heard by AI
Speech and translation systems underperform in low-resource languages because high-quality training data barely exists. Every donated sentence narrows that gap.
Preservation through digitization
Manuscripts, folk literature, and regional dialects risk fading away. Digitizing them protects cultural memory while making it usable for research.
Technology that serves communities
Better language AI means education tools, accessibility software, and information access in the languages people actually speak at home.
How Data Donation Works
From your voice to a language's future — four steps
- 1
Create your account
Sign up with Google or email. Your contributions stay linked to your profile.
- 2
Pick your language
Choose from nine supported languages — every dialect and accent is valuable.
- 3
Contribute in minutes
Type, record in your browser, or drag and drop files. No technical knowledge needed.
- 4
Power language AI
Contributions become organized, documented datasets for training speech, OCR, translation, and LLMs.
Contribution Categories
Four ways to strengthen the commons
Text Data
Sentences, articles, stories, poetry, proverbs, everyday writing.
- Examples
- Folk tales, news articles, social posts, school essays
- Impact
- Trains language models and translation systems.
- Formats
- Typed · TXT · CSV · XLSX · PDF · DOCX
Voice Data
Your voice speaking your language — with an optional transcription.
- Examples
- Read sentences, retold stories, dialect samples
- Impact
- Powers speech recognition (ASR) and natural text-to-speech.
- Formats
- Browser recording · MP3 · WAV · M4A · OGG
Documents
Books, manuscripts, newspapers, scanned pages, educational material.
- Examples
- Digitized books, magazines, letters, archives
- Impact
- Enables OCR and recovers hard-to-find written heritage.
- Formats
- PDF · DOCX · TXT · CSV · XLSX · ZIP
Datasets
Structured collections — parallel corpora, word lists, labeled data.
- Examples
- Translation pairs, lexicons, annotated corpora
- Impact
- Gives researchers ready-to-use building blocks for NLP.
- Formats
- CSV · JSON · JSONL · XLSX · ZIP
Community Impact
What your contribution makes possible
Speech recognition
Voice assistants, automatic subtitles, and hands-free tools for low-literacy users — in nine languages and counting.
Machine translation
Bridges between regional languages, Urdu, and English — for health, law, and education.
OCR & digitization
Searchable archives of books and manuscripts that today exist only on paper.
Language models
Chat, writing, and learning tools that are fluent in the languages people live in.
“When a language enters the digital world, its speakers enter with it.”
Every sentence, every minute of speech, and every scanned page helps train AI that can finally understand the languages of Pakistan.
FAQ
Common questions
Who can contribute?+
Anyone. Native speakers, heritage speakers, learners, teachers, writers, and researchers are all welcome. Every dialect and accent makes the data stronger.
Which languages are supported?+
Pashto, Urdu, Punjabi, Sindhi, Balochi, Hindko, Saraiki, Brahui, and Kashmiri at launch — with more languages added over time.
What happens to the data I donate?+
Your contributions are stored securely and organized by language. The long-term goal is clean, documented datasets used to train AI for speech, OCR, translation, and language-model research.
Do I need to speak the language perfectly?+
No. Real language — with regional accents, dialect words, and informal style — is exactly what AI systems need to learn.
Can I upload material I didn't write?+
Only if you have the right to share it — your own work, public-domain material, or content whose owner gave permission. Note the source in the upload form so it can be attributed correctly.
Is my personal information shared?+
No. Your account details stay private. Any data used for training will only ever include the language data itself, never personal details.
Your language. Your voice. Your contribution.
It takes two minutes to make your first donation — and every one counts.
Create Your Account