How to Build a Custom Voice AI Dataset From Scratch
A practical walkthrough of designing, recording, and validating a custom voice AI dataset for training or fine-tuning your own models.
Define the model first, then the data
The biggest mistake teams make is collecting audio before they know what the model needs to do. Start by writing a one-page spec that names the target task, the deployment environment, the languages and accents you must support, the vocabulary scope, and the minimum performance bar you need to clear. Everything that follows, from microphone choice to script design to speaker recruitment, should fall out of that spec. A TTS dataset has different needs than an ASR dataset, and a wake-word dataset needs different audio than a meeting-transcription dataset. The spec keeps you honest.
Design the script or prompt set
If you are building a read-aloud corpus, the script matters more than the speakers. It needs to balance phonetic coverage so every relevant phoneme appears in enough contexts, vocabulary breadth so the model sees enough words, and prosody variety so it learns the rhythm of natural speech. Public sentence sets like the Harvard sentences or the LibriSpeech prompt lists are reasonable starting points but rarely sufficient. For a domain-specific corpus, generate prompts that exercise the actual vocabulary your model will see in production. For a conversational corpus, prepare topic prompts rather than scripts, and let the speakers actually talk.

Recruit speakers with intention
Speaker selection is where most custom datasets quietly fail. You need enough distinct voices to prevent overfitting and enough demographic and accent variety to match your target population. Treat recruiting like hiring: write a clear ad, screen for the criteria you actually need, pay fairly, and keep records of consent. Always get a written agreement that includes AI training as a permitted use. If you cut corners on consent, you will pay for it later and possibly publicly.

Set up the recording environment
For studio-grade recordings, you do not need a commercial studio, but you do need a quiet room, a decent condenser microphone with a pop filter, and a consistent setup so every session sounds the same. Use 48 kHz, 24-bit recording, monitor levels during the session, and keep a slate at the start of every clip with the speaker ID and prompt ID. For in-the-wild data, capture with the same kind of device your end users will use, in the kinds of environments they will use them in. Document everything: device, software, location type, and any notable conditions.
Validate, label, and iterate
Recording is only half the work. The other half is segmenting, transcribing, validating, and packaging. Plan a quality assurance pass on at least a representative sample, ideally every clip. Track word error rate against your transcripts, audio quality metrics like SNR, and speaker coverage statistics. Hold back at least 5 percent of speakers as an evaluation set so you can measure generalization. Then train a baseline, look at where the model fails, and feed that feedback back into the next round of data collection. A good custom dataset is built in passes, not in one heroic week.

Frequently asked questions
How many hours do I need for a custom TTS voice?
Modern neural TTS can do a usable voice in 30 to 60 minutes of clean studio audio per speaker. A polished, expressive voice usually wants 5 to 20 hours.
How many hours for a custom ASR domain?
For fine-tuning an existing strong ASR model on a new domain, 20 to 100 hours often produces measurable gains. Training from scratch is a different and much larger undertaking.
Can I mix speakers across recording environments?
Yes, and you should, as long as you track the environment in metadata so you can analyze any biases that show up in evaluation.
Should I use professional voice actors?
For TTS, often yes. For ASR meant to handle real users, no. You want the speakers to sound like the people the model will eventually serve.
How do I handle consent for minors or sensitive populations?
With extra care. Get parental or guardian consent in writing, document the purpose, and consider whether you really need that population in the dataset at all.
Looking to license speech data?
Studio-grade conversational audio with aligned transcripts, full speaker metadata, and a documented chain of consent for every file. Get a sample within 48 hours of NDA.
Request a sample →


