Innovate Futures @ Benji

For Patreon Supporters - Wan Fun Control -ImageRestyle V2V (Version 20250328) (Workflow)

Added 2025-03-28 17:05:16 +0000 UTC

Video: https://youtu.be/YiVkevHuXIU

Related Post: https://www.patreon.com/posts/125384420

In this post, I’ll walk you through my ComfyUI workflow for generating AI-enhanced videos using Alibaba’s WAN 2.1 Fun Control models. This powerful tool allows you to restyle existing videos while preserving motion consistency—perfect for creating influencer content, music videos, or AI-driven animations.

Workflow Overview

The WAN 2.1 Video-to-Video (V2V) workflow leverages ControlNet-guided diffusion to:
✅ Restyle videos (e.g., change outfits, backgrounds, or entire aesthetics).
✅ Preserve motion from reference videos (e.g., dance moves, gestures).
✅ Maintain consistency across frames (no flickering or distortions).

Unlike traditional methods (e.g., AnimateDiff), WAN 2.1 uses a diffusion transformer architecture, ensuring smoother outputs with coherent styles.

Step-by-Step Workflow

1. Setup & Model Installation

Download the WAN 2.1 Fun Control 1.3B SafeTensor file from Hugging Face.

Place it in ComfyUI/models/diffusion/.

Ensure you have the latest ComfyUI update (git pull in the terminal).

2. Key Components

A. Reference Video & ControlNet

Input: A reference video

ControlNet Preprocessors:

DW Pose (for skeletal motion tracking).

Line Art (for outline consistency).

Depth Maps (for spatial depth).

These guide the AI to mimic motions while allowing creative restyling.

B. Style Transfer with Flux

The first frame of the reference video is restyled using:

Text prompts (e.g., "hip-hop dancer in a blue jacket").

LoRA models (for character consistency, if needed).

This "style seed" ensures the entire video follows the new aesthetic.

C. WAN 2.1 Fun Control Processing

The restyled frame + ControlNet data are fed into WAN 2.1’s diffusion pipeline:

Clip Vision Encode: Embeds the style reference.

K Sampler: Generates frames (steps: 20, CFG: 7.5).

Skip Layer Guidance: Enhances details (blocks 9-10).

D. Refinement (Optional)

Tile Control LoRA: Upscales and sharpens frames.

CFG-Zero Star: Improves contrast/coherence.

How It Works

Motion Extraction: ControlNet (DW Pose/Line Art) extracts poses/outlines from the reference video.

Style Injection: The first frame is restyled via Flux + text prompts.

Diffusion: WAN 2.1 regenerates the video frame-by-frame, blending the new style with the original motion.

Output: A seamless, high-quality video with:

Consistent character designs (thanks to LoRAs).

Stable backgrounds (no flickering).

Example Use Cases

Dance Video Restyling

Input: A dancer in casual clothes.

Output: Same moves, but with a cyberpunk outfit and neon-lit backdrop.

AI Influencer Content

Input: A stock video of a person talking.

Output: The same speech delivered by a custom AI avatar.

Animation Enhancement

Input: A rough storyboard animation.

Output: A polished, stylized final render.

Optimization Tips

For Creativity: Use only DW Pose (less restrictive than Line Art).

For Accuracy: Combine Line Art + DW Pose for precise motion tracking.

For Speed: Use the 1.3B model (VRAM-friendly; ~5GB).

For Quality: Refine with Tile LoRA + Skip Layer Guidance.

Conclusion

The WAN 2.1 Video-to-Video workflow in ComfyUI is a breakthrough for AI video generation. By combining ControlNet-guided motion with diffusion-based style transfer, it outperforms older tools like AnimateDiff in consistency and ease of use.

Ready to try it?

Grab the WAN 2.1 Fun Control modelhere : https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-Control
Follow the workflow above in ComfyUI.
Experiment with poses, styles, and refinements!

Workflow updated(2025-03-29):

Flux group, I update the input resolution times 2 for Empty Latent, because it works better for Flux Union Pro ControlNet to work with image above 1024px.