Question 1

What is a conversational AI dataset?

Accepted Answer

A dataset of natural multi-speaker dialogue used to train models that understand or generate conversation — dialog systems, audio-in/audio-out LLMs, meeting summarisers, and full-duplex voice agents.

Question 2

Are interruptions, overlap, and backchannels labelled?

Accepted Answer

Yes. Diarization captures overlap regions; backchannels (mhm, yeah, right) are tagged; interruptions are marked at the turn boundary. Overlap is preserved, not edited out.

Question 3

Are disfluencies preserved?

Accepted Answer

Yes. Ums, uhs, false starts, repetitions, self-corrections, and laughter are preserved in the transcript and tagged. Most datasets silently clean these out — we do not, because audio-LLMs need them.

Question 4

How long are recordings?

Accepted Answer

Typical recordings run 30–120 minutes of continuous dialogue. Some go to 180+. Long-context conversation is the gap most datasets leave open.

Question 5

How many speakers per recording?

Accepted Answer

Two to six speakers per recording. Two-person interviews dominate; panels and round-tables are available for crosstalk-heavy training.

Question 6

Is this suitable for audio-in/audio-out LLMs?

Accepted Answer

Yes. This is the cleanest source of long-form, full-duplex, naturalistic dialogue with consent — exactly what GPT-4o-class audio LLMs and Moshi-style full-duplex stacks need.

Question 7

Can I filter by topic, register, or speaker role?

Accepted Answer

Yes. Recordings are tagged by domain (tech, health, finance, sports, politics, culture, science), register (casual, formal, technical), and per-speaker role (host, guest, expert, caller).

Question 8

Is the dataset suitable for evaluation?

Accepted Answer

Yes. Held-out benchmark slices with human-reviewed transcripts are available, balanced for speaker diversity and overlap rate.

Question 9

Can I get a sample?

Accepted Answer

Yes. Email partnerships@aipodcast.io and we will send a 30-minute representative sample with audio, alignment, diarization, and disfluency tags after a quick scoping call.

Conversational AI data that sounds like people, not scripts.

Built for the work.

Multi-speaker dialogue

Natural turn-taking

Disfluencies preserved

Speaker metadata

Long sessions

Aligned transcripts

What is in the manifest.

The things scripted data never captures.

Turn-taking

Overlap

Disfluencies

Backchannels

Topic shifts

Register shifts

From email to first manifest.

Sample request

Mutual NDA

MSA + data licence

First delivery

Manifest & provenance

Ongoing delivery

Common questions.

Want a representative sample?