SPEECH SYNTHESIS · TTS TRAINING DATA

Studio-grade source audio for the next generation of TTS.

TTS quality is bottlenecked by the source. We license consistent, professionally-recorded audio with full prosodic range, multi-track per-speaker mixdowns, and consent terms that explicitly permit training generative speech models — including for commercial deployment.

48 kHz / 24-bit WAV · Multi-track per speaker · Consistent room tone · Full prosodic range
Why podcast audio is the right shape

What makes audio good for TTS training.

Read-aloud datasets are predictable but flat. YouTube-scraped audio is varied but inconsistent. Our podcast-sourced audio is varied and consistent, because professional podcasters record in the same room with the same setup over hundreds of hours.

Consistent room tone

Same studio, same mic, same gain staging across hundreds of hours per speaker. The unglamorous stuff that determines TTS quality.

Prosodic variety within a speaker

Long-form interviews naturally produce a wide range of intonation, emphasis, and emotional register — something a script-read corpus can't match.

Multi-track separation

Per-speaker WAV channels where the original recording supports it. Makes single-speaker fine-tuning trivial.

Full breath & silence

Real human silences and breath patterns, not edited-out for podcast distribution.

No audible processing artifacts

Sourced from raw or lightly-processed masters, not from compressed distribution copies.

Verified speaker metadata

Per-speaker age range, gender, L1, accent, mic model — not just “Speaker_034.”

Specs

What we deliver for TTS.

Audio

Format
48 kHz / 24-bit WAV
Tracks
Multi-track per speaker available

Speakers per dataset

Single
Single-speaker corpora for voice modeling
Multi
Multi-speaker corpora for general TTS pretraining

Hours per speaker

Top contributors
Several hundred hours per speaker available
Typical
Tens of hours per speaker minimum

Prosodic variety

Source
Long-form interviews naturally span emotional register
Range
Suitable for expressive TTS training

Metadata

Per speaker
Age range, gender, L1, accent region
Recording
Mic model, environment, episode count

Languages

Available
English (US, UK, AU)
Coming
Spanish, French, German, Japanese, more
Custom
Any language with podcast infrastructure
Voice cloning is a different product

Read this carefully.

Training a TTS model that produces a synthetic voice substantially similar to an identifiable speaker is voice cloning — a separate product line with separate consent and elevated pricing. We will not let you do this under a standard TTS license.

Standard TTS license permits

  • Multi-speaker neural TTS pretraining
  • Expressive TTS training
  • Cross-lingual TTS training
  • Per-speaker fine-tuning of brand voices the customer already owns
  • Evaluation set construction

Standard TTS license does NOT permit

  • Producing a synthetic voice substantially similar to a catalog speaker
  • Commercial use of any cloned voice
  • Bypassing watermarking requirements
  • Right-of-publicity-adjacent use cases
See voice cloning →

Want a representative TTS sample?

A TTS-ready sample with multi-track separation, full prosodic variety, and verified speaker metadata — delivered within 48 hours of NDA.