The WanInfiniteTalkToVideo node generates video sequences from audio input. It uses a video diffusion model, conditioned on audio features extracted from one or two speakers, to produce a latent representation of a talking head video. The node can generate a new sequence or extend an existing one using previous frames for motion context.

## Inputs

| Parameter | Description | Data Type | Required | Range |
| --- | --- | --- | --- | --- |
| `mode` | The audio input mode. `"single_speaker"` uses one audio input. `"two_speakers"` enables inputs for a second speaker and corresponding masks. | COMBO | Yes | `"single_speaker"`<br>`"two_speakers"` |
| `model` | The base video diffusion model. | MODEL | Yes | - |
| `model_patch` | The model patch containing audio projection layers. | MODELPATCH | Yes | - |
| `positive` | The positive conditioning to guide the generation. | CONDITIONING | Yes | - |
| `negative` | The negative conditioning to guide the generation. | CONDITIONING | Yes | - |
| `vae` | The VAE used for encoding images to and from the latent space. | VAE | Yes | - |
| `width` | The width of the output video in pixels. Must be divisible by 16. (default: 832) | INT | No | 16 - MAX_RESOLUTION |
| `height` | The height of the output video in pixels. Must be divisible by 16. (default: 480) | INT | No | 16 - MAX_RESOLUTION |
| `length` | The number of frames to generate. (default: 81) | INT | No | 1 - MAX_RESOLUTION |
| `clip_vision_output` | Optional CLIP vision output for additional conditioning. | CLIPVISIONOUTPUT | No | - |
| `start_image` | An optional starting image to initialize the video sequence. | IMAGE | No | - |
| `audio_encoder_output_1` | The primary audio encoder output containing features for the first speaker. | AUDIOENCODEROUTPUT | Yes | - |
| `motion_frame_count` | Number of previous frames to use as motion context when extending a sequence. (default: 9) | INT | No | 1 - 33 |
| `audio_scale` | A scaling factor applied to the audio conditioning. (default: 1.0) | FLOAT | No | -10.0 - 10.0 |
| `previous_frames` | Optional previous video frames to extend from. | IMAGE | No | - |
| `audio_encoder_output_2` | The second audio encoder output. Required when `mode` is set to `"two_speakers"`. | AUDIOENCODEROUTPUT | No | - |
| `mask_1` | Mask for the first speaker, required if using two audio inputs. | MASK | No | - |
| `mask_2` | Mask for the second speaker, required if using two audio inputs. | MASK | No | - |

**Parameter Constraints:**

* When `mode` is set to `"two_speakers"`, the parameters `audio_encoder_output_2`, `mask_1`, and `mask_2` become required.
* If `audio_encoder_output_2` is provided, both `mask_1` and `mask_2` must also be provided.
* If `mask_1` and `mask_2` are provided, `audio_encoder_output_2` must also be provided.
* If `previous_frames` is provided, it must contain at least as many frames as specified by `motion_frame_count`.

## Outputs

| Output Name | Description | Data Type |
| --- | --- | --- |
| `model` | The patched model with audio conditioning applied. | MODEL |
| `positive` | The positive conditioning, potentially modified with additional context (e.g., start image, CLIP vision). | CONDITIONING |
| `negative` | The negative conditioning, potentially modified with additional context. | CONDITIONING |
| `latent` | The generated video sequence in latent space. | LATENT |
| `trim_image` | The number of frames from the start of the motion context that should be trimmed when extending a sequence. | INT |

> This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! [Edit on GitHub](https://github.com/Comfy-Org/embedded-docs/blob/main/comfyui_embedded_docs/docs/WanInfiniteTalkToVideo/en.md)

---
**Source fingerprint (SHA-256):** `1ef125235ce5adb09972737d0e2863255315c536da718c7af230de1b4a7f53e2`