(ComfyUI Workflow) Fantasy Talking + Dia TTS + Flux - Multimodal AI Talking Avatar - Version 20250430
Added 2025-04-30 13:01:11 +0000 UTC
Tutorial Video : https://youtu.be/bSssQdqXy9A
Related Post: https://www.patreon.com/posts/127808092
Alibaba's Fantasy Talking represents a significant leap forward in talking avatar technology. When combined with Dia TTS for voice generation and Flux for character creation, you unlock a powerful workflow for producing dynamic, lifelike AI influencers and animated characters. In this post, we'll explore how to integrate these cutting-edge tools into an advanced content creation pipeline.
The Power Trio: Fantasy Talking, Dia TTS, and Flux
Fantasy Talking: Next-Generation Talking Avatars
Alibaba's Fantasy Talking builds upon their WAN 2.1 base model to deliver remarkably realistic talking avatars that surpass previous solutions in several key ways:
Whole-image rendering: Unlike models that only animate a bounding box around the mouth, Fantasy Talking regenerates the entire image frame-by-frame, resulting in more natural facial expressions and body language
Precise lip-syncing: When properly configured, the synchronization between audio and mouth movements achieves unprecedented accuracy
Body language control: Through text prompts, you can influence how much your avatar moves—from subtle head tilts to expressive hand gestures
Dia TTS: Expressive Voice Generation
Dia TTS brings your avatars to life with:
Life-like speech with natural pauses and intonation
Non-verbal sounds like laughter and breathing
Potential for voice cloning (though quality may vary in the public version)
Flux: Consistent Character Creation
Flux allows you to:
Generate multiple consistent images of your AI influencer
Train LoRAs for specific character styles
Create a library of poses and expressions for your avatar
Setting Up the Advanced Workflow
Prerequisites
ComfyUI installed and configured
Fantasy Talking models downloaded (1.6GB FP16 version recommended)
WAN 2.1 image-to-video base model (480p or 720p resolution)
Access to Dia TTS (Hugging Face Space or API)
Step 1: Character Generation with Flux
Begin by creating your base character:
Set up your Flux workflow in ComfyUI
Generate multiple images of your character in different poses
Optionally train a LoRA for consistent character generation
Save your favorite images for use in the talking avatar pipeline
Step 2: Voice Generation with Dia TTS
Create your audio track:
Prepare your script with SSML tags for pauses and emphasis
Add notations for non-verbal sounds (e.g., [laughs], [sighs])
Generate the audio file (WAV or MP3 format)
For best results, consider:
Keeping fast-paced segments under 5 seconds
Adding natural pauses between sentences
Using voice cloning if you need a specific vocal tone
Step 3: Animation with Fantasy Talking
Bring your character to life:
Load your Flux-generated image into the Fantasy Talking workflow
Input your Dia TTS audio file
Configure key settings:
FPS: Match to your audio's pace (start with 24, adjust up to 60 for fast speech)
CFG scale: Higher values (7-10) for more movement, lower (3-5) for subtlety
Steps: 30-50 for quality, adjust based on your hardware
Resolution: 512x512 works well, but you can go up to 1024x1024 with sufficient VRAM
Add movement prompts (e.g., "subtle head nods," "expressive hand gestures")
Generate and review the output
Pro Tips for Optimal Results
FPS is crucial: Fast dialogue requires higher FPS (30-60) to maintain sync
Test incrementally: Generate short clips to verify sync before long renders
Layer your workflow:
First pass: Focus on lip sync
Second pass: Add body language
Third pass: Fine-tune expressions
Hardware considerations:
For lower VRAM systems, use 480p models
Consider cloud options for high-resolution renders
Post-processing:
Use video editing software to blend multiple takes
Add subtle background motion for more dynamism
Overcoming Common Challenges
Out-of-sync audio:
Solution: Increase FPS and ensure consistent frame rates across all nodes
Robotic movements:
Solution: Adjust CFG scale and add more descriptive movement prompts
VRAM limitations:
Solution: Reduce resolution or use the Hugging Face Space demo
Unnatural facial expressions:
Solution: Experiment with different base images and adjust the CLIP vision settings
Pushing the Boundaries
For truly next-level results:
Emotion tagging: Use Dia TTS's emotion parameters to drive facial expressions
Scene transitions: Generate multiple clips with different poses and edit together
Interactive avatars: Combine with chatbot technology for real-time responses
Multi-character scenes: Render characters separately and composite in editing

Conclusion
The combination of Fantasy Talking, Dia TTS, and Flux represents a significant advancement in AI-generated video content. While the learning curve can be steep—particularly when balancing FPS settings with audio pace—the results justify the effort. This workflow is particularly powerful for:
Digital influencers and brand ambassadors
Educational content and tutorials
Interactive storytelling experiences
Personalized video messaging
As these tools continue to evolve, we're rapidly approaching a future where AI-generated characters are indistinguishable from human performers in many contexts. By mastering this advanced workflow now, you're positioning yourself at the forefront of this creative revolution.
FantasyTalking
https://huggingface.co/acvlab/FantasyTalking
Demo : https://huggingface.co/spaces/acvlab/FantasyTalking
ComfyUI Model Download: https://huggingface.co/Kijai/WanVideo_comfy/blob/main/fantasytalking_fp16.safetensors
Dia TTS : https://github.com/nari-labs/dia
Custom node: https://github.com/nobrainX2/comfyUI-customDia
Demo : https://huggingface.co/spaces/nari-labs/Dia-1.6B
Attached 2 workflows:
1 - DiaTTS with FTalking
2 - Flux Character Lora gen , Dia TTS with F Talking.
Have Fun :)
Comments
Block swap memory summary: Transformer blocks on cpu: 6078.91MB Transformer blocks on cuda:0: 10131.52MB Total memory used by transformer blocks: 16210.43MB Non-blocking memory transfer: True ---------------------- TeaCache: Using cache device: cpu Sampling 81 frames at 480x832 with 30 steps hi I stuck in this. can you help me ? I am using 4080 12GB Vram
Thap Huy
2025-05-03 08:50:36 +0000 UTC