Which Whisper size should I fine-tune?

Start with small or medium. Fine-tuning large is heavier and rarely necessary unless you have plenty of data and very specific accuracy needs.

Do I need GPU for fine-tuning Whisper?

Yes. Even small Whisper benefits from a GPU, and medium and above effectively require one.

Can I fine-tune on synthetic TTS data?

You can, and it can help vocabulary coverage, but it is no substitute for real audio in real conditions.

How long does a fine-tune typically take?

A few hours to a few days, depending on data size and Whisper variant. The data preparation usually takes longer than the training itself.

How do I avoid catastrophic forgetting?

Mix some general-domain data into your fine-tuning corpus, use a low learning rate, and keep training short.

How to Fine-Tune Whisper on Your Own Audio Data

How much data you actually need

For a focused fine-tune, 20 to 100 hours of clean labeled audio in your target domain is enough to see meaningful gains. Below 10 hours you can still help on narrow vocabulary but you risk overfitting. Above a few hundred hours the marginal benefit shrinks unless you are also expanding into new accents, environments, or speaker populations. The ratio of data to gain is steepest in the first 50 hours, which makes a small but well-curated dataset extremely valuable.

Data preparation

Whisper expects 16 kHz mono audio in segments of up to 30 seconds. Resample, downmix, and segment your audio to match. Transcripts should be normalized consistently with how you want the model to output text, including punctuation and casing decisions. Do not feed the model conflicting normalization conventions. Hold back 5 to 10 percent of speakers as a validation set, not just 5 to 10 percent of clips, so you can measure speaker-level generalization rather than memorization.

Studio-grade source audio is the bottleneck for production speech AI

Training setup

Fine-tune the smallest Whisper variant that gives you acceptable baseline accuracy. Smaller models train faster, fit in less memory, and overfit less aggressively. Use a low learning rate, somewhere in the 1e-5 to 1e-6 range, for a few epochs. Freeze the encoder if your data is small or noisy. Watch the validation loss and stop early when it stops improving. Whisper rewards short, careful runs more than long brute-force ones.

Real conversation has overlap, repair, and pacing that scripted reads cannot reproduce

Evaluation that means something

Word error rate on a held-out set is the headline number, but it can be misleading. Break it down by speaker, by accent, by length of clip, by signal-to-noise ratio, and by vocabulary domain. A 1 point average WER drop that hides a 5 point regression on a key user group is a regression. Also evaluate on out-of-distribution audio, like a different show or environment, to estimate how the model will perform when reality drifts from your training set.

Where the data comes from

You can record your own, scrape it from your existing recordings, or buy licensed audio from a marketplace. The licensed path is usually the fastest way to get a curated, transcript-aligned, rights-cleared corpus that is ready to feed into a fine-tune script. We supply that kind of data ourselves, and customers who use it generally hit their target metrics with less engineering effort than they expected.

How to Fine-Tune Whisper on Your Own Audio Data

How much data you actually need

Data preparation

Training setup

Evaluation that means something

Where the data comes from

Frequently asked questions

Looking to license speech data?

How much data you actually need

Data preparation

Training setup

Evaluation that means something

Where the data comes from

Frequently asked questions

Looking to license speech data?

Related articles

Multilingual Speech Datasets for AI: Coverage, Cost, and Pitfalls

Synthetic vs Real Speech Data for AI Training: Which Wins?

Collecting Voice Training Data: A Practical Guide for AI Teams