SPEECH RECOGNITION · ASR TRAINING DATA

Train ASR models on the speech your users actually produce.

Read-aloud datasets like LibriSpeech are great for benchmarks and useless for production. Your users don't read from a script — they interrupt each other, they trail off, they switch languages mid-sentence, they record from a kitchen with the dishwasher running. We license real conversational audio with word-level transcripts, full speaker metadata, and explicit commercial training rights.

48 kHz / 24-bit WAV · Word-level aligned transcripts · Verified diarization · Commercial training rights
Compare

Why our data outperforms read-aloud datasets for ASR.

aipodcast conversationalRead-aloud open datasetsGeneric crowd vendors
Natural conversational pacing & overlap~
Disfluencies (uh, um, false starts, restarts)~
Multi-speaker turn-taking~
Studio-grade acoustic baseline~~
Word-level aligned transcripts~~
Speaker diarization labels~
Rich speaker metadata~
Commercial training rights~ Read the license~
Per-file provenance
Specs

What we deliver for ASR.

Audio

Format
48 kHz / 24-bit WAV
Resample
16 kHz on request
Channels
Multi-track per speaker

Transcripts

Timestamps
Word-level
Diarization
Per-speaker labels
Disfluencies
Retained or stripped at your choice

Formats

Recommended
JSON for ASR pipelines
Also
WebVTT, SRT, TextGrid

Speaker metadata

Demographics
Age range, gender, L1, accent region
Environment
Mic model, room treatment notes

Acoustic metadata

Mic
Per-file
Sample rate
Per-file
Bit depth
Per-file

Optional

Phoneme
Phoneme-level alignment
QA
Human-verified pass
Use cases

Common ASR use cases we support.

Foundation pretraining

Large multi-language conversational corpora as a counterweight to LibriSpeech and Common Voice.

Accent & dialect expansion

Targeted collection in regional accents you're under-served on. Any language with podcast infrastructure.

Domain adaptation

Interview-style, panel-style, scripted-dialogue, narrative monologue — sample by scenario.

Diarization training

Multi-speaker audio with verified speaker boundaries and per-channel separation where available.

WER benchmarking

Held-out evaluation sets that look like your production traffic, not like read-aloud reference corpora.

Code-switching

Multi-language speakers in the same conversation, where the catalog supports it.

Want a representative ASR sample?

30 minutes of audio + transcripts + metadata, delivered within 48 hours of NDA. Run it through your pipeline before you talk to us.