SOLUTIONS

Conversational AI data that sounds like people, not scripts.

Conversational AI needs more than tidy turns. It needs interruptions, overlap, backchannels, topic drift, register shifts, and disfluencies — the things every other dataset silently scrubs out. Our catalogue is built from working podcasts where all of that happens naturally, on consent.

48 kHz / 24-bit WAV · Word-level aligned transcripts · Verified consent · Commercial training rights
350+
Hours of multi-speaker dialogue
2–6
Speakers per recording
30–120 min
Continuous turn context
Preserved
Disfluencies, overlap, backchannels
§ 01 — What you get

Built for the work.

Multi-speaker dialogue

Real interviews, panels, and round-tables with two to six speakers per recording.

Natural turn-taking

Backchannels, hedges, interruptions, and overlap captured in context with timestamped boundaries.

Disfluencies preserved

Ums, uhs, false starts, repetitions, self-corrections, and laughter — kept in the transcript and tagged. Audio LLMs need them.

Speaker metadata

Per-speaker language, dialect, age range, and role (host, guest, expert, caller).

Long sessions

30–120 minute continuous conversations — not isolated turns. Context windows your model can actually use.

Aligned transcripts

Word-level alignment, RTTM diarization, overlap regions, and backchannel tags included by default.

§ 02 — Conversation spec

What is in the manifest.

Field
Coverage
Format
Notes
Speakers per recording
2–6
RTTM + JSON
Stable IDs across episodes
Recording length
30–120 min (some 180+)
48 kHz / 24-bit WAV
Continuous, not chunked
Overlap rate
5–18% typical
Region tags
Filterable by overlap density
Backchannels
Tagged inline
Word-level JSON
mhm, yeah, right, wow
Disfluencies
Preserved + tagged
Word-level JSON
um, uh, false start, repair
Topic shifts
Annotated
Segment-level
Useful for dialog state tracking
Register
Casual / formal / technical
Per-segment tag
Filter for code-switching across registers
Consent & provenance
100% written, every speaker
Per-file SHA-256
Named consent contact for life
§ 03 — What real conversation looks like

The things scripted data never captures.

Turn-taking

Real conversational floor management — gaps, latching, smooth handoffs, and the rare clean turn.

Overlap

5–18% of speech overlaps. Your full-duplex agent has to handle it. Ours is tagged by region.

Disfluencies

“So, um, the — the thing is” is how humans talk. We keep every um and uh, tagged.

Backchannels

Mhm. Yeah. Right. Wow. The acknowledgments that keep dialogue alive. Tagged inline.

Topic shifts

Annotated segment boundaries so dialog-state models learn how humans actually pivot subjects.

Register shifts

Casual to technical and back inside the same conversation. The thing audio-LLMs are weakest on.

§ 04 — How engagement works

From email to first manifest.

01

Sample request

Tell us the model and target overlap rate. We return a 30-min sample with full disfluency and overlap tagging within 48 hours.

02

Mutual NDA

Standard one-page mutual.

03

MSA + data licence

Perpetual commercial training licence, named consent contact for life, written speaker release on every voice.

04

First delivery

Pilot shard with 30–120 min recordings, RTTM, word-level JSON, disfluency and overlap tags, register labels.

05

Manifest & provenance

Per-file lineage: speakers, recording date, mic chains, room, consent version, jurisdiction, SHA-256.

06

Ongoing delivery

Monthly increments, locale expansion, custom overlap targets, written revocation SLA.

§ 05 — FAQ

Common questions.

What is a conversational AI dataset?

A dataset of natural multi-speaker dialogue used to train models that understand or generate conversation — dialog systems, audio-in/audio-out LLMs, meeting summarisers, and full-duplex voice agents.

Are interruptions, overlap, and backchannels labelled?

Yes. Diarization captures overlap regions; backchannels (mhm, yeah, right) are tagged; interruptions are marked at the turn boundary. Overlap is preserved, not edited out.

Are disfluencies preserved?

Yes. Ums, uhs, false starts, repetitions, self-corrections, and laughter are preserved in the transcript and tagged. Most datasets silently clean these out — we do not, because audio-LLMs need them.

How long are recordings?

Typical recordings run 30–120 minutes of continuous dialogue. Some go to 180+. Long-context conversation is the gap most datasets leave open.

How many speakers per recording?

Two to six speakers per recording. Two-person interviews dominate; panels and round-tables are available for crosstalk-heavy training.

Is this suitable for audio-in/audio-out LLMs?

Yes. This is the cleanest source of long-form, full-duplex, naturalistic dialogue with consent — exactly what GPT-4o-class audio LLMs and Moshi-style full-duplex stacks need.

Can I filter by topic, register, or speaker role?

Yes. Recordings are tagged by domain (tech, health, finance, sports, politics, culture, science), register (casual, formal, technical), and per-speaker role (host, guest, expert, caller).

Is the dataset suitable for evaluation?

Yes. Held-out benchmark slices with human-reviewed transcripts are available, balanced for speaker diversity and overlap rate.

Can I get a sample?

Yes. Email partnerships@aipodcast.io and we will send a 30-minute representative sample with audio, alignment, diarization, and disfluency tags within 48 hours of NDA.

Want a representative sample?

30 minutes of audio + transcripts + metadata, delivered within 48 hours of NDA.