Innovate Futures @ Benji

(For Patreon Supporters) Natively Running WAN 2.1 VACE + GGUF and Integrating with Flux for First Frame Ver.20250521

Added 2025-05-21 15:09:53 +0000 UTC

Tutorial Video : https://www.youtube.com/watch?v=UUCmCyABmSc

GGUF Repo https://huggingface.co/QuantStack/Wan2.1-VACE-14B-GGUF

ComfyUI Repackaged Repo https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main

ComfyUI Blog Post https://blog.comfy.org/p/wan21-vace-native-support-and-ace

Step 1: Setting Up WAN 2.1 Models in ComfyUI

To begin, ensure you’ve downloaded the necessary models for WAN 2.1:

Text-to-Video Models : These allow you to generate videos from textual prompts.
Additional Models for VACE : Specifically, the 14B FP8 and BF16 models are essential for achieving high-quality results.

Once downloaded, navigate to the ComfyUI Hugging Face repository or the AI Videos Repackaged Repo . Here, you can find pre-packaged models like the 14B single-file model , which includes both WAN 2.1 Base and VACE functionalities. Keep in mind that the 14B model is resource-intensive, requiring at least 34GB File size and a lot more that 20+ GB VRAM to run efficiently.

For users with lower VRAM, fear not. The GGUF quantization models offer a practical solution. These models come in various sizes, ranging from 8-bit to 4-bit quantized versions , allowing even mid-tier GPUs to handle video generation. While lower quantization levels (like 4-bit) may result in some quality loss, they remain viable for most use cases.

Step 2: Configuring the Native Node Workflow

With the models in place, let’s dive into configuring the native node workflow:

Load the GGUF Model : Start by selecting the appropriate GGUF format file based on your system’s capabilities. For example, if you’re working with limited VRAM, opt for the Q5 quantized model .
Connect Nodes :
- Use the trim video latent node, specific to VACE workflows, to refine your video output.
- Connect the green dot of the trim latent with the trim video latent , ensuring seamless integration between nodes.
- Link the pink dot (output latent paint) from the KSampler to the final VAE decode step.

This setup ensures that all latent outputs are processed correctly, resulting in smooth and coherent video generation.

Step 3: Optimizing Settings for Performance

Before diving into video generation, it’s crucial to optimize your settings:

Sampling Steps : When using CausVid LoRA , reduce sampling steps to a range of 3 to 8 . This not only speeds up the process but also maintains quality.
CFG Value : Set the CFG value to 1 when using CausVid LoRA for optimal results.
Resolution Settings : Maintain uniform width and height values (e.g., 808x480 ) across your workflow. This avoids unnecessary resizing and ensures consistent output quality.

Additionally, consider leveraging low VRAM, high RAM solutions if you’re constrained by GPU memory. This approach allows you to allocate system RAM for processing, enabling smoother operation even on less powerful hardware.

Step 4: Integrating Flux for Enhanced First Frames

One of the standout features of this workflow is the integration of Flux for refining first frames. Here’s how to incorporate it:

Reference Image Input : Begin by loading a reference image that captures the desired pose or style. This image will guide the initial frame generation.
ControlNet Pose : Use ControlNet to ensure accurate pose replication. If your reference image doesn’t match the starting frame of your video, the system will intelligently adapt styles and patterns from the reference to the generated video.
Flux Processing : After generating the first frame with Flux, resize it to match your video dimensions. Pass this refined image back into the workflow as the new starting point.
Final Sampling and Decoding : With the first frame set, proceed to sample and decode the remaining frames. The combination of Flux and ControlNet ensures that character styles and poses remain consistent throughout the video.

Step 5: Testing and Iteration

Once your workflow is configured, it’s time to test and refine:

Example Scenario : Generate a two-minute video using the CausVid 14B LoRA model with sampling steps set to 8 . Monitor VRAM and CPU usage to ensure your system can handle the workload.
Quality Check : Review the output, paying close attention to character tracking and style coherence. The 14B model excels in maintaining consistency, even when characters switch positions or camera angles change.
Processing Time : On a system with 24GB VRAM and 46GB RAM , generating an 81-frame video took approximately 4 minutes . Adjust your settings based on your hardware capabilities.

This workflow I enahnce based on the example workflow for native node.

Several things better methods:

Native Node Support : By leveraging ComfyUI’s native nodes, you eliminate the need for external wrappers or additional dependencies, streamlining the entire process.
Resource Efficiency : The inclusion of GGUF quantization models makes high-quality video generation accessible to users with varying hardware setups.
Enhanced Control : Integrating Flux and ControlNet provides unparalleled control over first frames, ensuring that your videos meet your creative vision.