ASR training data that actually transcribes the room.
Automatic speech recognition models live or die on transcript quality and acoustic diversity. We deliver both — multi-speaker conversational audio with sub-100ms word alignment, real-room acoustics, and rich metadata, all consented for training.
Built for the work.
Word-level alignment
Sub-100ms time-aligned transcripts in JSON, CTM, TextGrid, SRT or VTT. Casing and punctuation preserved.
Speaker diarization
Clean speaker turns with stable per-speaker IDs across long-form interviews, panels, and round-tables. RTTM included.
Acoustic diversity
Shure SM7B, Rode NT1, MKH 416, Zoom H6, AirPods, and PSTN call legs. Treated rooms and bad bedrooms in the same shard.
Long-form context
Hours of continuous conversation, not 10-second clips — the gap most ASR datasets leave wide open.
Domain coverage
Tech, business, health, finance, entertainment, sports, politics, science. Specify the mix you need by hours.
Transcription QA
Machine-aligned by default, human-reviewed on demand. Benchmark sets ship at <1% WER.
How much do you actually need.
Where this data ends up in production.
Call-center transcription
Multi-speaker, narrowband-friendly audio for agent-customer ASR with disfluencies and crosstalk preserved.
Meeting notes & summaries
Long-form panel and round-table audio that mirrors Zoom, Meet, and in-room conference acoustics.
Podcast indexing
Episode-length audio with chapter-aware transcripts — train search and recommendation against the medium itself.
Video captioning
Broadcast-loudness audio normalized to EBU R128, ready for caption pipelines and live-event ASR.
Accessibility
WCAG-grade caption training data with named-entity tags, speaker labels, and non-speech event annotation.
Voice agents
Conversational turn-taking and interruption data for full-duplex agent stacks built on Whisper or USM.
From email to first manifest.
Sample request
Tell us the model, target locales, and hours. We return a 30-minute representative sample with audio, alignment, and diarization within 48 hours.
Mutual NDA
Standard one-page mutual. We have signed with frontier labs you have heard of, the legal review is short.
MSA + data licence
Perpetual commercial training licence, named contact for life, written speaker release on every voice in the shard.
First delivery
Pilot shard (typically 10–25 hrs) with full manifest, SHA-256 per file, alignment, diarization, and consent receipts.
Manifest & provenance
Per-file lineage: speaker ID, recording date, mic chain, room, consent version, jurisdiction. Audit-ready out of the box.
Ongoing delivery
Monthly increments, revocation SLA, locale expansion, and a named human on Slack — not a ticket queue.
Common questions.
What is ASR training data?
Paired speech audio and text transcripts used to train automatic speech recognition models. Quality, diversity, and alignment accuracy determine downstream WER far more than raw hour count.
Which models is this data designed for?
Whisper-large fine-tunes, Conformer, USM, NVIDIA NeMo, wav2vec2, HuBERT, and any encoder-decoder ASR architecture. We deliver in the manifest formats those training pipelines expect — JSONL for Whisper, NeMo manifest for NeMo, TSV for fairseq.
How many hours do I need to fine-tune Whisper?
For a domain adaptation, 50–200 hours of in-domain audio is the typical sweet spot. For a new low-resource locale from scratch, plan on 500+ hours. We will help you size it on the sample call.
Do you provide diarization?
Yes. Every multi-speaker file ships with stable per-speaker IDs in RTTM and JSON, including overlap regions. You can drop it straight into pyannote training.
What transcript format do you use?
Word-level JSON with start/end timestamps, casing, and punctuation by default. CTM, TextGrid, SRT, VTT, and custom formats on request. Transcripts include disfluencies — we do not silently clean them out.
How accurate are the transcripts?
Machine-aligned with sub-100ms word boundary accuracy on clean audio. Optional human review brings benchmark sets to under 1% WER on clean speech.
What about long-form context and real-room acoustics?
This is the gap most ASR datasets leave open. We deliberately ship 20–90 minute continuous conversations and mix studio, untreated room, remote-guest, and field recordings so the model does not collapse the moment it leaves a clean clip.
Is this data legally safe to train on?
Yes. 100% written speaker consent, perpetual commercial training licence, named consent contact for life, full per-speaker provenance, and a written revocation SLA. We are the only supplier with that whole stack.
Can I get a sample?
Yes. Email partnerships@aipodcast.io and we will send a 30-minute representative sample with audio, alignment, diarization, and metadata within 48 hours of NDA.
Want a representative sample?
30 minutes of audio + transcripts + metadata, delivered within 48 hours of NDA.