WAN2.2 S2V with ComfyUI 　「WAN2.2のS2Vでリップシンクさせる」

Date: 2025.9.4

in English

The WAN2.2 version of the AI model that generates video from audio and images has been released. It's called WAN2.2 S2V, and ComfyUI has an official website that explains how to use it.

Wan2.2-S2V Audio-Driven Video Generation ComfyUI Native Workflow Example - ComfyUI

You can download the workflow from the Download JSON Workflow link above.

1. How to use

Looking at the overall workflow, it is divided into two flows as shown below.

The flow above is the high-speed version using LightX2V LoRA, and the bypassed flow below is the standard version without any acceleration.

The basic usage is written in the brown nodes, but you can also create videos by simply inputting images and audio into the nodes below.

The length of the video can be specified in the Chunk Length item, and a fixed value of 77 seems to be stable. Since the setting is 16 FPS, it is just under 5 seconds. If you want to create a longer video, adjust it using Video S2V Extend below. Each node adds 77 frames (default setting). If you don't want to make it longer, just bypass the node.

Also, regarding the image size, please note that if it is not a multiple of 16, the following error will occur.

RuntimeError: The size of tensor a (27) must match the size of tensor b (26) at non-singleton dimension 3

When we input an image of a man giving a speech and the audio of President Kennedy's speech into the above workflow and ran it, the following video was created.

For English audio, lip syncing seems to work almost perfectly.

2. Actual measurement

Now, using a Japanese song and image, we will conduct actual measurements under the following conditions for both the high-speed version using LightX2V and the version without it.

- Generates 77 frames + 144 frames of Video S2V Extend.

・The settings for each are as per the official workflow: Step 4, CFG 1.0 if using accelerated LoRA, Step 20, CFG 6.0 if not using accelerated LoRA.

- Changed model only to 14B_Q8_0.gguf.

Add a BlockSwap node and set the Swap value to the maximum of 40 (this will result in the slowest generation speed).

・The image here will be resized to 480x720.

・The sound source will use a portion of YOASOBI's IDOL.

・The PC environment used during measurement is as follows:

　M/B: MPG B550 GAMING PLUS (Note that the slot is PCIE 4.0)

　CPU : Ryzen7 5700X

　GPU : RTX5060ti 16GB

　RAM : DDR4 3200 128GB (32GBx4)

The resulting video is below:

■ With lightweight LoRA (generation time: 366.33 seconds)

■ Lightweight without LoRA (generation time: 52 minutes 44 seconds)

The lip sync accuracy seems to be less than in English.

Without the lightweight LoRA, it took an unrealistically long time, and the results were very rough. However, with the lightweight LoRA, there was little movement, so I felt that a middle ground was desirable.

日本語解説（in Japanese）

音声と画像から動画を生成するAIモデルのWAN2.2版がリリースされています。WAN2.2 S2V というものがそれで、ComfyUI公式で使い方が紹介されています。

上記リンク先の Download JSON Workflow からワークフローをダウンロードできます。

１．使い方

ワークフローの全体像を確認すると、以下のように2つのフローに分かれています。

上のフローはLightX2V LoRAを使用した高速版、下のバイパスされているフローは高速化無しの通常版です。

基本的な使い方は茶色のノードに書かれていますが、画像と音声を以下のノードに入れるだけでも動画が作成できます。

動画の尺に関して、Chunk Lengthという項目で長さを指定するのですが、77固定が安定するようです。設定が16FPSになっていますから5秒弱です。それ以上に長い動画を作りたい場合は、以下のVideo S2V Extendで調整します。ノード1つ毎に77フレーム（初期設定の場合）追加されます。長くしたくなければノードをバイパスすれば良いです。

また、画像サイズに関してですが、16の倍数でないと以下のエラーが出ますので注意してください。

RuntimeError: The size of tensor a (27) must match the size of tensor b (26) at non-singleton dimension 3

上記のワークフローに演説している男性の画像とケネディ大統領の演説音声を入力して実行すると、以下のような動画が作成されました。

英語音声の場合、リップシンクはほぼ完ぺきに動作するようです。

２．実測

それでは日本語音声の歌と画像を使用し、LightX2Vを使用した高速版と使用しない場合それぞれについて以下のような条件で実測を行います。

・77フレーム＋Video S2V Extendの144フレームを生成。

・それぞれの設定は公式ワークフローのとおり、高速化LoRAを使用する場合はSteps 4、CFG 1.0。高速化無しの場合はSteps 20、CFG 6.0。

・モデルのみ14B_Q8_0.ggufに変更。

・BlockSwapノードを追加し、Swap設定値を最大の40にします（このため生成速度は最も遅くなります）。

・画像はこちら480x720にリサイズしたものを使います。

・音源はYOASOBIのIDOLの一部を使用します。

・計測時のPC環境は以下となります。

　M/B: MPG B550 GAMING PLUS (スロットが PCIE4.0 な点に注意)

　CPU : Ryzen7 5700X

　GPU : RTX5060ti 16GB

　RAM : DDR4 3200 128GB (32GBx4)

結果の動画は以下となりました。

■ 軽量化LoRAあり（生成時間：366.33 秒）

■ 軽量化LoRAなし（生成時間：52分44秒）

リップシンクの精度は英語に比べて良くないように思えます。

軽量化LoRAなしの場合、現実的ではないほどに時間がかかりました。そして結果も非常に荒ぶっています。とはいえ軽量化LoRAありでは動きが少ないため、この中間が欲しいなと感じました。

WAN2.2 S2V with ComfyUI 「WAN2.2のS2Vでリップシンクさせる」