Train voice agents that sound like people, not IVR menus.
Most “conversational” datasets are scripted reads with two voice actors taking polite turns. Real conversation has interruptions, overlapping speech, backchannels, false starts, repair sequences, and pacing that varies with topic. We license multi-speaker dialogue from real podcasters who do this every day for a living.
What we deliver.
Multi-speaker conversations
Verified turn boundaries and per-speaker labels. Two-, three-, and four-way dialogue available.
Natural turn-taking
Overlap regions preserved, not edited out. The social signal that makes voice agents feel human.
Backchannel events
“Uh-huh,” “right,” “mm-hmm” — annotated where available.
Repair sequences
False starts, restarts, and self-corrections preserved, not cleaned up to look pretty.
Topic & domain metadata
Sample by scenario for retrieval-augmented agent training.
Multi-track audio
One channel per speaker where the original recording supports it — diarization training becomes trivial.
Common voice-agent use cases we support.
Foundation training
- For
- Full-duplex speech agents
- Why
- The kind that can interrupt and be interrupted
Turn-taking modeling
- For
- Backchannel & gap detection
- Why
- Social signals that make agents feel human
Domain adaptation
- Interview
- Support agents
- Panel
- Multi-party agents
- Narrative
- Monologue agents
Latency benchmarking
- Source
- Natural conversations
- Reference
- Real human-baseline turn latencies
Evaluation corpora
- For
- End-to-end voice agent quality scoring
- Held-out
- Designed to mirror production traffic
Diarization
- Source
- Multi-track audio
- Labels
- Verified speaker boundaries per file
Why this is hard to source elsewhere.
Most “conversational” datasets in the open and commercial markets fall into one of these traps. We don't.
| Source | What's wrong with it |
|---|---|
| Two-speaker scripted reads | Clean but unnatural — no interruptions, no overlap, no real pacing |
| Telephone customer service recordings | Natural but legally unusable — no consent for AI training |
| Single-speaker podcasts | Wrong shape — no dialogue dynamics |
| YouTube-scraped interviews | Legally unusable — no consent, no provenance, no contactable speakers |
| aipodcast multi-speaker conversations | Real dialogue, signed releases, contactable speakers, per-file provenance — what you actually need |
Want a multi-speaker sample with diarization?
Get a representative conversational sample with full diarization within 48 hours of NDA. Real overlap, real backchannels, real repairs.