You can create a voice in two ways:
Create a voice from a description
Upload or record your voice
The method you choose depends on whether you want to generate a new voice style or reproduce a specific speaker.
Create a voice from a description
Select Create a voice from a description to generate a voice based on a short written description of the speaker.
This method is useful for quick exploration and testing different voice styles. The system generates several sample voices that you can preview before creating the voice.
Because the voice is generated from text, the quality and accuracy of the result depend on how clearly the description defines the speaker.
If you need a voice based on a specific person or want a highly consistent voice across projects, consider creating a voice using Upload Audio or Record Audio instead.
Writing effective voice descriptions
The description determines how the generated voice will sound. It defines the speaker’s characteristics such as language, accent, personality, and speaking style.
Clear descriptions usually produce better results. Including details such as gender, age range, tone, pacing, or emotional style helps the system generate a voice that matches your intent.
Short descriptions can also work well when you need a neutral or general-purpose voice. For example, a simple prompt like “confident female training instructor” may already produce a suitable result.
Choose the level of detail based on your goal. A distinctive character voice may require more description, while a standard narrator may only need a few key attributes.
How to structure a voice description
For more predictable results, it helps to include a few key elements in your description.
Start by specifying the language and regional accent of the speaker. Then define the gender and approximate age range.
Next, describe the persona or role of the speaker, along with a few words that capture the emotional tone of the voice.
Finally, add one or two short sentences explaining how the voice should sound and be delivered. You can describe the tone, pacing, clarity, or speaking style.
Example:
A native Spanish speaker with a neutral Latin American accent. Female, around 35–45.
Persona: corporate trainer. Tone: confident, supportive, professional.
Warm and clear voice with steady pacing and precise articulation. Speaks in an instructional style that emphasizes key points while remaining approachable.
Description tips
Avoid describing audio effects such as reverb, echo, phone, or tape
Clearly specify the language and regional accent
Focus on describing the speaker and delivery style, rather than technical audio terms
Upload or record your voice
Instant voice cloning is powerful, but the quality of the result depends entirely on the quality of the input.
A clean, natural recording will produce a voice that sounds authentic and stable. A poor recording will limit the outcome, no matter how advanced the model is.
Minimum requirements
To achieve reliable results, make sure the source audio meets the following criteria:
30 to 90 seconds of clean speech
No background noise
No music
No reverb or echo
Natural tone of voice, not shouting or whispering
The goal is to capture how the speaker normally sounds in everyday conversation.
Ideal recording setup
For best results, record in a controlled environment:
A quiet room with soft surfaces such as curtains, sofas, or carpets
A good USB microphone instead of a laptop’s built-in mic
Modern smartphones also have surprisingly good microphones and can produce excellent results when used properly.
Mobile recording
Recording on a mobile device is fine if:
The room is quiet
There is no echo or background noise
The speaker talks clearly and naturally
The recording is made in a voice memo app at the highest available quality
A short, clean recording in a quiet space will always outperform a longer recording in a noisy environment. Prioritize clarity over length, and natural delivery over performance.