Train ASR models on the speech your users actually produce.
Read-aloud datasets like LibriSpeech are great for benchmarks and useless for production. Your users don't read from a script — they interrupt each other, they trail off, they switch languages mid-sentence, they record from a kitchen with the dishwasher running. We license real conversational audio with word-level transcripts, full speaker metadata, and explicit commercial training rights.
Why our data outperforms read-aloud datasets for ASR.
| aipodcast conversational | Read-aloud open datasets | Generic crowd vendors | |
|---|---|---|---|
| Natural conversational pacing & overlap | ✓ | ✕ | ~ |
| Disfluencies (uh, um, false starts, restarts) | ✓ | ✕ | ~ |
| Multi-speaker turn-taking | ✓ | ✕ | ~ |
| Studio-grade acoustic baseline | ✓ | ~ | ~ |
| Word-level aligned transcripts | ✓ | ~ | ~ |
| Speaker diarization labels | ✓ | ✕ | ~ |
| Rich speaker metadata | ✓ | ✕ | ~ |
| Commercial training rights | ✓ | ~ Read the license | ~ |
| Per-file provenance | ✓ | ✕ | ✕ |
What we deliver for ASR.
Audio
- Format
- 48 kHz / 24-bit WAV
- Resample
- 16 kHz on request
- Channels
- Multi-track per speaker
Transcripts
- Timestamps
- Word-level
- Diarization
- Per-speaker labels
- Disfluencies
- Retained or stripped at your choice
Formats
- Recommended
- JSON for ASR pipelines
- Also
- WebVTT, SRT, TextGrid
Speaker metadata
- Demographics
- Age range, gender, L1, accent region
- Environment
- Mic model, room treatment notes
Acoustic metadata
- Mic
- Per-file
- Sample rate
- Per-file
- Bit depth
- Per-file
Optional
- Phoneme
- Phoneme-level alignment
- QA
- Human-verified pass
Common ASR use cases we support.
Foundation pretraining
Large multi-language conversational corpora as a counterweight to LibriSpeech and Common Voice.
Accent & dialect expansion
Targeted collection in regional accents you're under-served on. Any language with podcast infrastructure.
Domain adaptation
Interview-style, panel-style, scripted-dialogue, narrative monologue — sample by scenario.
Diarization training
Multi-speaker audio with verified speaker boundaries and per-channel separation where available.
WER benchmarking
Held-out evaluation sets that look like your production traffic, not like read-aloud reference corpora.
Code-switching
Multi-language speakers in the same conversation, where the catalog supports it.
Want a representative ASR sample?
30 minutes of audio + transcripts + metadata, delivered within 48 hours of NDA. Run it through your pipeline before you talk to us.