Studio-grade source audio for the next generation of TTS.
TTS quality is bottlenecked by the source. We license consistent, professionally-recorded audio with full prosodic range, multi-track per-speaker mixdowns, and consent terms that explicitly permit training generative speech models — including for commercial deployment.
What makes audio good for TTS training.
Read-aloud datasets are predictable but flat. YouTube-scraped audio is varied but inconsistent. Our podcast-sourced audio is varied and consistent, because professional podcasters record in the same room with the same setup over hundreds of hours.
Consistent room tone
Same studio, same mic, same gain staging across hundreds of hours per speaker. The unglamorous stuff that determines TTS quality.
Prosodic variety within a speaker
Long-form interviews naturally produce a wide range of intonation, emphasis, and emotional register — something a script-read corpus can't match.
Multi-track separation
Per-speaker WAV channels where the original recording supports it. Makes single-speaker fine-tuning trivial.
Full breath & silence
Real human silences and breath patterns, not edited-out for podcast distribution.
No audible processing artifacts
Sourced from raw or lightly-processed masters, not from compressed distribution copies.
Verified speaker metadata
Per-speaker age range, gender, L1, accent, mic model — not just “Speaker_034.”
What we deliver for TTS.
Audio
- Format
- 48 kHz / 24-bit WAV
- Tracks
- Multi-track per speaker available
Speakers per dataset
- Single
- Single-speaker corpora for voice modeling
- Multi
- Multi-speaker corpora for general TTS pretraining
Hours per speaker
- Top contributors
- Several hundred hours per speaker available
- Typical
- Tens of hours per speaker minimum
Prosodic variety
- Source
- Long-form interviews naturally span emotional register
- Range
- Suitable for expressive TTS training
Metadata
- Per speaker
- Age range, gender, L1, accent region
- Recording
- Mic model, environment, episode count
Languages
- Available
- English (US, UK, AU)
- Coming
- Spanish, French, German, Japanese, more
- Custom
- Any language with podcast infrastructure
Read this carefully.
Training a TTS model that produces a synthetic voice substantially similar to an identifiable speaker is voice cloning — a separate product line with separate consent and elevated pricing. We will not let you do this under a standard TTS license.
Standard TTS license permits
- Multi-speaker neural TTS pretraining
- Expressive TTS training
- Cross-lingual TTS training
- Per-speaker fine-tuning of brand voices the customer already owns
- Evaluation set construction
Standard TTS license does NOT permit
- Producing a synthetic voice substantially similar to a catalog speaker
- Commercial use of any cloned voice
- Bypassing watermarking requirements
- Right-of-publicity-adjacent use cases
Want a representative TTS sample?
A TTS-ready sample with multi-track separation, full prosodic variety, and verified speaker metadata — delivered within 48 hours of NDA.