SOLUTIONS

ASR training data that actually transcribes the room.

Automatic speech recognition models live or die on transcript quality and acoustic diversity. We deliver both — multi-speaker conversational audio with sub-100ms word alignment, real-room acoustics, and rich metadata, all consented for training.

48 kHz / 24-bit WAV · Word-level aligned transcripts · Verified consent · Commercial training rights
350+
Hours of conversational audio
2,400+
Consented speakers
<100ms
Word boundary alignment
48 kHz
/ 24-bit WAV masters
§ 01 — What you get

Built for the work.

Word-level alignment

Sub-100ms time-aligned transcripts in JSON, CTM, TextGrid, SRT or VTT. Casing and punctuation preserved.

Speaker diarization

Clean speaker turns with stable per-speaker IDs across long-form interviews, panels, and round-tables. RTTM included.

Acoustic diversity

Shure SM7B, Rode NT1, MKH 416, Zoom H6, AirPods, and PSTN call legs. Treated rooms and bad bedrooms in the same shard.

Long-form context

Hours of continuous conversation, not 10-second clips — the gap most ASR datasets leave wide open.

Domain coverage

Tech, business, health, finance, entertainment, sports, politics, science. Specify the mix you need by hours.

Transcription QA

Machine-aligned by default, human-reviewed on demand. Benchmark sets ship at <1% WER.

§ 02 — By model architecture

How much do you actually need.

Model
Recommended hours
Format we ship
Notes
Whisper-large fine-tune
50–200 hrs in-domain
16 kHz mono FLAC + JSONL
Word timestamps, language tag, prompt field
Whisper from scratch
680k+ hrs (don’t)
Manifest only
Use the open weights, fine-tune instead
Conformer / NeMo
300–1,000 hrs
16 kHz WAV + NeMo manifest
Char and BPE tokenizer ready
USM-style universal
2,000+ hrs across locales
48 kHz WAV + parquet
Multilingual sharding by locale
wav2vec2 / HuBERT
500+ hrs unlabeled + 10–100 hrs labeled
16 kHz WAV + TSV
Self-supervised pretrain split included
Streaming / RNN-T
200–500 hrs low-latency
16 kHz WAV + force-aligned CTM
Chunked, no future context leakage
Diarization (pyannote)
100+ hrs multi-speaker
RTTM + 48 kHz WAV
Overlapped speech labeled
§ 03 — Use cases

Where this data ends up in production.

Call-center transcription

Multi-speaker, narrowband-friendly audio for agent-customer ASR with disfluencies and crosstalk preserved.

Meeting notes & summaries

Long-form panel and round-table audio that mirrors Zoom, Meet, and in-room conference acoustics.

Podcast indexing

Episode-length audio with chapter-aware transcripts — train search and recommendation against the medium itself.

Video captioning

Broadcast-loudness audio normalized to EBU R128, ready for caption pipelines and live-event ASR.

Accessibility

WCAG-grade caption training data with named-entity tags, speaker labels, and non-speech event annotation.

Voice agents

Conversational turn-taking and interruption data for full-duplex agent stacks built on Whisper or USM.

§ 04 — How engagement works

From email to first manifest.

01

Sample request

Tell us the model, target locales, and hours. We return a 30-minute representative sample with audio, alignment, and diarization within 48 hours.

02

Mutual NDA

Standard one-page mutual. We have signed with frontier labs you have heard of, the legal review is short.

03

MSA + data licence

Perpetual commercial training licence, named contact for life, written speaker release on every voice in the shard.

04

First delivery

Pilot shard (typically 10–25 hrs) with full manifest, SHA-256 per file, alignment, diarization, and consent receipts.

05

Manifest & provenance

Per-file lineage: speaker ID, recording date, mic chain, room, consent version, jurisdiction. Audit-ready out of the box.

06

Ongoing delivery

Monthly increments, revocation SLA, locale expansion, and a named human on Slack — not a ticket queue.

§ 05 — FAQ

Common questions.

What is ASR training data?

Paired speech audio and text transcripts used to train automatic speech recognition models. Quality, diversity, and alignment accuracy determine downstream WER far more than raw hour count.

Which models is this data designed for?

Whisper-large fine-tunes, Conformer, USM, NVIDIA NeMo, wav2vec2, HuBERT, and any encoder-decoder ASR architecture. We deliver in the manifest formats those training pipelines expect — JSONL for Whisper, NeMo manifest for NeMo, TSV for fairseq.

How many hours do I need to fine-tune Whisper?

For a domain adaptation, 50–200 hours of in-domain audio is the typical sweet spot. For a new low-resource locale from scratch, plan on 500+ hours. We will help you size it on the sample call.

Do you provide diarization?

Yes. Every multi-speaker file ships with stable per-speaker IDs in RTTM and JSON, including overlap regions. You can drop it straight into pyannote training.

What transcript format do you use?

Word-level JSON with start/end timestamps, casing, and punctuation by default. CTM, TextGrid, SRT, VTT, and custom formats on request. Transcripts include disfluencies — we do not silently clean them out.

How accurate are the transcripts?

Machine-aligned with sub-100ms word boundary accuracy on clean audio. Optional human review brings benchmark sets to under 1% WER on clean speech.

What about long-form context and real-room acoustics?

This is the gap most ASR datasets leave open. We deliberately ship 20–90 minute continuous conversations and mix studio, untreated room, remote-guest, and field recordings so the model does not collapse the moment it leaves a clean clip.

Is this data legally safe to train on?

Yes. 100% written speaker consent, perpetual commercial training licence, named consent contact for life, full per-speaker provenance, and a written revocation SLA. We are the only supplier with that whole stack.

Can I get a sample?

Yes. Email partnerships@aipodcast.io and we will send a 30-minute representative sample with audio, alignment, diarization, and metadata within 48 hours of NDA.

Want a representative sample?

30 minutes of audio + transcripts + metadata, delivered within 48 hours of NDA.