Training a 3B-Parameter TTS Model on a Language That Didn't Have One

When I started, Urdu didn't have a text-to-speech model worth using. The state of the art was either a 1990s formant synthesizer that sounded like a robot reading the Quran on a tape recorder, or a handful of small models trained on a few hours of news audio — wooden, monotone, and unable to handle the simplest poetic register. Eighteen months later, Orpheus Urdu TTS is a 3B-parameter model that has crossed 179,000 downloads on HuggingFace and gets used in production by people I have never met. This is the story of how that happened — including the parts that did not work.

The Problem with Urdu TTS

The first lie I had to unlearn was that "low-resource" means "small dataset." It does not. Low-resource means the entire infrastructure is missing. You do not have a forced aligner. You do not have a phoneme set anyone agrees on. You do not have a normalization pipeline that handles English loanwords gracefully, and Urdu has thousands of them.

Urdu writes in Nastaliq — a Perso-Arabic script that flows diagonally, with letters that change shape based on neighbors and ligatures that span entire words. It is fundamentally context-sensitive. The same word can have three valid orthographic forms, and the model has to learn that they sound identical. Then there is code-switching: real Urdu speakers drop English words mid-sentence — "main aaj meeting mein hoon" — and any TTS that can't handle this is useless outside a classroom.

I built three pipelines before I realized the problem wasn't the model. It was that no one had ever written down what "clean Urdu text" actually means.

The Dataset

The dataset took eight months. That is longer than the model training took. I started with every public source I could find: Common Voice (tiny, noisy), Pakistani news broadcasts (great audio, terrible licensing), audiobooks from public-domain literature, parliamentary proceedings, and a long tail of YouTube content I had to filter aggressively for quality.

The cleaning pipeline ended up being four stages:

Voice activity detection with a fine-tuned Silero model to chop out silence and music beds.
Speaker diarization to keep clips that were a single speaker for ≥ 4 seconds.
Whisper-large transcript verification against the source text, with a CER threshold of 0.08 — anything noisier got rejected.
Manual spot-checks on a stratified random sample of 500 clips per source.

What I ended up with was 208 hours of clean Urdu audio, 37 distinct speakers, and a normalization spec that handles dates, numbers, English loanwords, and the worst offender — Urdu poetry, which breaks every rule about pronunciation timing.

The Architecture

I chose Orpheus 3B because it gave me three things at once: a strong English prior to bootstrap from, a discrete audio codec (SNAC) that handles low-resource languages well, and an autoregressive backbone that I already understood how to fine-tune. The alternative would have been training a VITS or YourTTS model from scratch, which I had tried and which had not worked — too little data, too much variance.

The modifications were smaller than I expected:

Extended the tokenizer with a custom Urdu-aware BPE trained on 12M cleaned sentences. The original tokenizer butchered Urdu into per-character fragments, which destroyed quality.
Added speaker conditioning embeddings for the 37 voices in the training set, with a fallback "neutral" embedding for inference.
Kept the SNAC codec frozen. Re-training the codec was tempting but every ablation showed it hurt more than it helped at this scale.

That was it. The model is fundamentally Orpheus — what changed was the data and the tokenizer. I think this is the lesson nobody emphasizes enough: most of the work in low-resource speech is upstream of the model.

Training

I had two A100s. That is it. A 3B-parameter autoregressive model on 208 hours of audio is not a small training job, and I was paying for it out of pocket. The first lesson was brutal: do not iterate on a 3B model. I ran 47 ablations on a 350M sandbox version before I touched the full model. Every hyperparameter, every data-mix ratio, every loss schedule — all of it was decided at small scale first.

Then there was Karachi. Karachi has a relationship with electricity that can be described as occasional. The longest single power cut during the training run was eleven hours. I lost two checkpoints to UPS failures. After the second one, I rewrote my training loop to checkpoint every 200 steps instead of every epoch, and to write to two physical disks simultaneously. The training script became, slowly, a small piece of infrastructure software in its own right.

The hardest bug to find was a numerical instability that only triggered after 18 hours of training, on a single specific batch where speaker 23 happened to read a poem with seven consecutive long vowels. It took a week. The fix was three lines.

Results

We evaluated against three baselines: the original Coqui Urdu TTS, the Microsoft Urdu Azure voice, and a re-implementation of the YourTTS Urdu paper from 2022. Mean Opinion Score (MOS) from 24 native speakers, listening in random order with blinded sources:

Coqui Urdu: 2.41 \cdot YourTTS-Urdu: 3.08 \cdot Azure: 3.71 \cdot Orpheus Urdu: 4.18

The number that mattered more was the code-switch MOS — sentences with mixed Urdu and English. Azure dropped to 2.9 on these. Orpheus held at 4.0. That gap is everything, because real Urdu speakers code-switch constantly, and a TTS that can't handle it is a TTS no one will use.

Then I open-sourced it. The model is on HuggingFace, weights and inference code, no gatekeeping. 179K downloads later, the most rewarding messages I have received are from accessibility researchers building screen readers, and from a teacher in rural Sindh using it to make audio lessons for kids who can't read yet.

What's Next

Three things, in order:

Sindhi. I have started collecting Sindhi audio with the Haveli dataset. The same pipeline, a fraction of the data. We will see how far transfer learning gets us.
Pashto. Different writing system, different prosody, almost no public data. This will be the hard one.
Scaling Urdu to 500 hours. The 208h dataset is the floor, not the ceiling. With more speakers, more registers, and more emotional range, the next version of Orpheus Urdu should be able to read poetry the way it deserves to be read.

The broader point I keep coming back to: the field's frontier is not always the biggest model or the cleverest architecture. Sometimes it is the language that nobody trained on, in the country that nobody put on the map of compute. That frontier is wide open. There is a lot of room.

If you want to collaborate on Pakistani-language speech research, the contact links are on the home page. Reach out.