Voice creation

You can create a voice in two ways:

Create a voice from a description
Upload or record your voice

The method you choose depends on whether you want to generate a new voice style or reproduce a specific speaker.

Create a voice from a description

Select Create a voice from a description to generate a voice based on a short written description of the speaker.

This method is useful for quick exploration and testing different voice styles. The system generates several sample voices that you can preview before creating the voice.

Because the voice is generated from text, the quality and accuracy of the result depend on how clearly the description defines the speaker.

If you need a voice based on a specific person or want a highly consistent voice across projects, consider creating a voice using Upload Audio or Record Audio instead.

Writing effective voice descriptions

The description determines how the generated voice will sound. It defines the speaker’s characteristics such as language, accent, personality, and speaking style.

Clear descriptions usually produce better results. Including details such as gender, age range, tone, pacing, or emotional style helps the system generate a voice that matches your intent.

Short descriptions can also work well when you need a neutral or general-purpose voice. For example, a simple prompt like “confident female training instructor” may already produce a suitable result.

Choose the level of detail based on your goal. A distinctive character voice may require more description, while a standard narrator may only need a few key attributes.

How to structure a voice description

For more predictable results, it helps to include a few key elements in your description.

Start by specifying the language and regional accent of the speaker. Then define the gender and approximate age range.

Next, describe the persona or role of the speaker, along with a few words that capture the emotional tone of the voice.

Finally, add one or two short sentences explaining how the voice should sound and be delivered. You can describe the tone, pacing, clarity, or speaking style.

Example:

A native Spanish speaker with a neutral Latin American accent. Female, around 35–45.

Persona: corporate trainer. Tone: confident, supportive, professional.

Warm and clear voice with steady pacing and precise articulation. Speaks in an instructional style that emphasizes key points while remaining approachable.

Description tips

Avoid describing audio effects such as reverb, echo, phone, or tape
Clearly specify the language and regional accent
Focus on describing the speaker and delivery style, rather than technical audio terms

Upload or record your voice

Instant voice cloning is powerful, but the quality of the result depends entirely on the quality of the input.

A clean, natural recording will produce a voice that sounds authentic and stable. A poor recording will limit the outcome, no matter how advanced the model is.

Minimum requirements

To achieve reliable results, make sure the source audio meets the following criteria:

30 to 90 seconds of clean speech
No background noise
No music
No reverb or echo
Natural tone of voice, not shouting or whispering

The goal is to capture how the speaker normally sounds in everyday conversation.

Ideal recording setup

For best results, record in a controlled environment:

A quiet room with soft surfaces such as curtains, sofas, or carpets
A good USB microphone instead of a laptop’s built-in mic

Modern smartphones also have surprisingly good microphones and can produce excellent results when used properly.

Mobile recording

Recording on a mobile device is fine if:

The room is quiet
There is no echo or background noise
The speaker talks clearly and naturally
The recording is made in a voice memo app at the highest available quality

A short, clean recording in a quiet space will always outperform a longer recording in a noisy environment. Prioritize clarity over length, and natural delivery over performance.

Using Custom Voices Across Languages

Any custom voice can speak any of the 32 supported languages — provide text in that language and the voice will speak it.

Important: The accent comes from the voice itself, not the text.

Voice Source	Result When Speaking Other Languages
Created from English recording	Speaks German with an English accent
Created from German recording	Speaks English with a German accent
Generated from description (e.g., "Native Spanish, European")	Carries that accent into other languages

Native-Sounding Multilingual Content

For native-quality audio in each language:

Approach	When to Use
Create separate voices per language	Best quality — one native voice per market
Use stock voices	Pre-built voices already native per language
Accept non-native accent	When brand consistency matters more than native fluency