Wan 2.2 Animate: AI Character Swap & Lip‑Sync in ComfyUI

You can now replace a person in a video with a new character, keep the original speech, sync lip movements, and even restyle the entire clip—all for free—with Wan 2.2 Animate inside ComfyUI. In this guide, I’ll walk you through the full workflow I use to get stable results, improve consistency, and reduce common artifacts.

I previously noticed quality issues with Wan 2.2 Animate, so I tested alternate settings and structure. The workflow below produced better motion, stronger character consistency when turning or moving, and reliable subject detection, even when the person interacts with objects.

By the end, you’ll know how to set up ComfyUI, load the workflow, swap characters, sync lips, drive animation from another video, and upscale for cleaner final output.

What is Wan 2.2 Animate?

Wan 2.2 Animate is a free video generation model that can perform character replacement, maintain or transfer motion, and support lip-sync based on the original audio or mouth movement in your footage. In ComfyUI, it runs as a node-based workflow, giving you control over input frames, masking, motion source, and output settings.

It supports two main approaches:

Character swap with the original background kept intact.
Motion transfer from your source video onto a reference image, keeping the reference image’s background.

Overview

Area	What It Covers	Key Actions
Setup	Install ComfyUI and required models	Install ComfyUI, download Wan 2.2 Animate models, load the workflow
Hardware	VRAM and performance	Aim for 24 GB GPU VRAM; consider cloud options if local hardware is limited
Inputs	Video and reference image	Upload your source video, set frame count and dimensions, add a replacement or stylized character image
Controls	Masking, prompts, frame rate	Adjust expand value, write a concise prompt, set fps
Motion/Background	Character swap vs. motion transfer	Keep original background or switch to the reference image’s background
Optional Speed	Torch compile with Sage Attention	Enable if configured on your system
Output	Generation preview and export	Review results; save video from ComfyUI output folder
Post	Upscaling and interpolation	Enhance resolution and motion with an external tool

Key Features of Wan 2.2 Animate

Character replacement while preserving the original video’s camera motion and background.
Motion transfer to a reference image, retaining the reference background.
Lip-sync that matches mouth movement to speech in the source.
Strong physics and motion fidelity with good subject tracking across turns and movement.
Flexible prompt control and frame rate matching.

Setup and Requirements

Install ComfyUI

Download ComfyUI from the official site and run the installer for your operating system.
Launch ComfyUI and confirm it opens in your browser.

ComfyUI is the environment where you will run the Wan 2.2 Animate workflow.

Hardware and VRAM

Wan 2.2 Animate typically requires at least 24 GB of GPU VRAM for stable operation.
More VRAM speeds up generations and allows longer clips or higher resolutions.
If your local machine cannot meet this requirement, you can run ComfyUI on cloud services with higher-end GPUs.

Get Wan 2.2 Animate Models

Download the Wan 2.2 Animate model files from the official sources.
Place them in the correct ComfyUI folders as instructed by each model’s documentation.
Keep your folder structure organized so the workflow can find the models automatically.

Load the Workflow and Install Missing Nodes

Import the Wan 2.2 Animate workflow file by dragging it into ComfyUI.
If a prompt appears listing missing custom nodes, open the Manager and install the required nodes one by one.
Restart ComfyUI after installation. The workflow will then load properly.

Workflow Guide: Character Swap and Lip-sync

Prepare Your Source Video

Click the video input to upload the clip containing the person you want to replace.
Use 1080p or lower for faster processing while you test. Both vertical and horizontal formats work.
Keep the subject clear and visible for better tracking.

Shorter clips are easier to iterate on while you dial in settings. Once you’re confident, you can increase length and resolution.

Select Frames to Process

Set the number of frames you want to process.
I’ve been able to process 300–400 frames reliably, but results vary by VRAM, resolution, and motion complexity.
For a quick test, start with around 100–150 frames.

Longer clips require more VRAM and time. Consider cloud GPUs if you want to push beyond your local limits.

Set Output Dimensions

Choose output dimensions that match your source aspect ratio.
For vertical video, 576 × 1024 strikes a good balance between quality and speed on mid-range GPUs.
Higher resolutions increase VRAM usage and processing time.

You can upscale later, so keep initial generation manageable for faster iterations.

Choose Your Replacement Character

You have two paths: a completely new character or a stylized version of the same person.

Option A: New Character Image

Upload a clear image of the character you want to swap in.
A medium shot or closer works well; full body isn’t required.
Avoid images where the subject is tiny in frame.

The workflow will replace the subject in your video while preserving the original background.

Option B: Stylized Version of the Same Character

Export a frame from your source video where the subject is most visible.
Use an image-to-image tool to transform that frame into your target style (for example, anime, claymation, or 3D).
Keep the face and key features recognizable so the swap remains consistent.

Step-by-step for image-to-image stylization:

Open your chosen image-to-image tool.
Upload the exported frame.
Enter a clear style prompt (for example, “anime style,” “stop-motion clay look,” or “3D toon shading”).
Pick a suitable model within the tool and generate.
Save the stylized image and upload it as the reference image in ComfyUI.

This approach lets you restyle the entire video while keeping motion and lip-sync aligned to the original performance.

Optional Speed Setting: Torch Compile and Sage Attention

In the One Video Torch Compile Settings node, you can un-bypass it and set Attention Mode to Sage Attention.
This may speed up generation if Torch compile and the Sage components are installed correctly on your system.
If you run into instability or see no improvement, keep it bypassed.

This optimization is optional and hardware-dependent.

Refine the Subject Mask

Find the Grow Mask With Blur node and adjust the Expand value.
Default is 10 pixels, which is fine if the replacement character is a similar size to the original.
If your new character is larger, increase this value (for instance, 25) to give the model room to fit the subject without clipping.

Mask expansion helps reduce harsh edges and mismatches around the subject.

Choose Background and Motion Source

By default, the workflow performs a character swap:

It keeps your original video’s background and inserts the new character into that scene.

To drive motion from your video onto the reference image while keeping the reference background:

Disconnect the Background Image and Mask nodes from the One Video Animate Embeds node.
This tells the workflow to use the reference image’s background and apply the motion from your source.

Decide which approach fits your goal before generating.

Prompt and Frame Rate

In the prompt box, write a short, literal description of the action (for example, “female clown talking”).
Wan 2.2 Animate defaults to 16 fps. If you want to match your source, set the frame rate in both Video Combine nodes.
Keep prompts concise; this model responds well to direct descriptions.

Matching fps can help you avoid timing mismatches during editing.

Run the Workflow and Preview

Click the run button to start generation.
When processing completes, preview the output in the designated node.
Check edges, lip movement, and overall motion fidelity.

Expect minor artifacts depending on the footage and reference. If needed, tweak mask expansion, frame count, or resolution, then regenerate.

Quality Optimization After Generation

Upscale and Enhance

If the generated video looks soft or low-res, upscale it:

Open your ComfyUI output folder to find the rendered clip.
Use a video enhancement app to upscale by 2x or 4x and apply detail enhancement.
Choose a model tuned for AI-generated content for cleaner results without harsh artifacts.

Upscaling after generation lets you keep iterations fast while still delivering a sharp final.

Frame Interpolation

If you generated at 16 fps:

Use frame interpolation to convert to a higher frame rate (for example, 24, 30, or 60 fps).
Interpolation fills in missing frames for smoother motion.
Export once you are satisfied with sharpness and timing.

This step improves motion fluidity without re-running the full generation.

Tips and Troubleshooting

VRAM matters: If you hit errors or slowdowns, reduce resolution, shorten frame count, or move to a cloud GPU with more VRAM.
Keep inputs clean: Clear, well-lit source footage and a sharp reference image produce better swaps and tracking.
Mask expansion: Increase the Expand value if you see edge tearing, clipping, or outlines near the subject.
Frame count: Start small to confirm settings, then scale up frames for longer shots.
Background choice: For a straightforward replacement, keep the default connections. For motion transfer with the reference background, disconnect the Background Image and Mask nodes as described.
Prompt simplicity: Stick to a short, direct description. Avoid overly detailed text that can introduce noise.
Output fps: Match your source fps if you plan to intercut with the original footage in editing.

Conclusion

Wan 2.2 Animate inside ComfyUI makes free character replacement and lip-sync practical with solid motion fidelity and reliable subject tracking. The workflow above focuses on stable settings, mask control, and clear choices for background and motion source, so you can produce clean results and iterate efficiently.

Start with manageable frame counts and moderate resolution, confirm your mask expansion, and keep prompts short. Then upscale and interpolate to finish with a sharp, smooth final video. With careful inputs and a few key tweaks, you can swap characters, restyle performances, and maintain convincing lip-sync—all in a single, repeatable pipeline.