AI Avatars used to take half a day of studio recording. With Lumen's Avatar V model (the same family as HeyGen's flagship), you need one clear photo and a script. That's it.
Pick a reference
Two options work best:
- A single high-quality photo — well-lit, facing the camera, eyes visible. Avoid sunglasses, hats that shade the eyes, or strong side light. Phone selfies near a window are perfect.
- A 15-second clip of you talking — better for long-form, because Lumen captures your real gestures and energy as the foundation.
Write a script that sounds like you
Read it out loud first. If you wouldn't actually say it, the avatar will sound stilted. Use contractions, short sentences, and occasional CAPS to emphasise words — Avatar V is audio-driven, so emphasis in your typed script becomes emphasis on the avatar's face.
Voice + language
Pick a voice that matches the energy. For 175+ language coverage, Lumen automatically cross-renders the voice in the language you select — so a friendly American voice can deliver the same script in Japanese with the same warmth.
Background and motion
- Studio — soft cream, neutral.
- Office — for B2B content.
- Cafe / Garden — warmer settings, great for personal content.
- Transparent — alpha-channel export, drop into any video editor.
Motion options are Still, Subtle, or Expressive. Subtle is the right default for almost everything. Use Expressive only for high-energy content like ads or sales pitches.
Generate and review
A 30-second avatar renders in roughly 90 seconds. Your first generation is for review — pay attention to: hand gestures (do they distract?), eye contact (does the avatar look at the right spot?), and mouth shapes (do the lips genuinely match the audio?).
Tip — drive realism with audio
The single biggest lever for avatar quality is the audio you feed it. A monotone audio = a flat avatar. A warm, expressive read = a warm, expressive avatar. If your generations feel lifeless, re-record the voiceover in Lumen with the same voice but more energy in the read.