SPEECH SYNTHESIS · TTS TRAINING DATA

Studio-grade source audio for the next generation of TTS.

TTS quality is bottlenecked by the source. We license consistent, professionally-recorded audio with full prosodic range, multi-track per-speaker mixdowns, and consent terms that explicitly permit training generative speech models — including for commercial deployment.

Request a TTS sample →See pricing

48 kHz / 24-bit WAV · Multi-track per speaker · Consistent room tone · Full prosodic range

Why podcast audio is the right shape

What makes audio good for TTS training.

Read-aloud datasets are predictable but flat. YouTube-scraped audio is varied but inconsistent. Our podcast-sourced audio is varied and consistent, because professional podcasters record in the same room with the same setup over hundreds of hours.

⌁

Consistent room tone

Same studio, same mic, same gain staging across hundreds of hours per speaker. The unglamorous stuff that determines TTS quality.

⌁

Prosodic variety within a speaker

Long-form interviews naturally produce a wide range of intonation, emphasis, and emotional register — something a script-read corpus can't match.

⌁

Multi-track separation

Per-speaker WAV channels where the original recording supports it. Makes single-speaker fine-tuning trivial.

⌁

Full breath & silence

Real human silences and breath patterns, not edited-out for podcast distribution.

⌁

No audible processing artifacts

Sourced from raw or lightly-processed masters, not from compressed distribution copies.

⌁

Verified speaker metadata

Per-speaker age range, gender, L1, accent, mic model — not just “Speaker_034.”

Specs

What we deliver for TTS.

Audio

Format: 48 kHz / 24-bit WAV
Tracks: Multi-track per speaker available

Speakers per dataset

Single: Single-speaker corpora for voice modeling
Multi: Multi-speaker corpora for general TTS pretraining

Hours per speaker

Top contributors: Several hundred hours per speaker available
Typical: Tens of hours per speaker minimum

Prosodic variety

Source: Long-form interviews naturally span emotional register
Range: Suitable for expressive TTS training

Metadata

Per speaker: Age range, gender, L1, accent region
Recording: Mic model, environment, episode count

Languages

Available: English (US, UK, AU)
Coming: Spanish, French, German, Japanese, more
Custom: Any language with podcast infrastructure

Voice cloning is a different product

Read this carefully.

Training a TTS model that produces a synthetic voice substantially similar to an identifiable speaker is voice cloning — a separate product line with separate consent and elevated pricing. We will not let you do this under a standard TTS license.

Standard TTS license permits

Multi-speaker neural TTS pretraining
Expressive TTS training
Cross-lingual TTS training
Per-speaker fine-tuning of brand voices the customer already owns
Evaluation set construction

Standard TTS license does NOT permit

Producing a synthetic voice substantially similar to a catalog speaker
Commercial use of any cloned voice
Bypassing watermarking requirements
Right-of-publicity-adjacent use cases

Want a representative TTS sample?

A TTS-ready sample with multi-track separation, full prosodic variety, and verified speaker metadata — delivered after a quick scoping call.

Request a sample →or email jaeden@fiund.com