The WanSCAILToVideo node prepares conditioning and an empty latent space for video generation. It processes optional inputs like reference images, pose videos, and CLIP vision outputs, embedding them into the positive and negative conditioning for a video model. The node outputs the modified conditioning and a blank latent tensor of the specified video dimensions.

## Inputs

| Parameter | Description | Data Type | Required | Range |
| --- | --- | --- | --- | --- |
| `positive` | The positive conditioning input. | CONDITIONING | Yes | - |
| `negative` | The negative conditioning input. | CONDITIONING | Yes | - |
| `vae` | The VAE model used for encoding images and video frames. | VAE | Yes | - |
| `width` | The width of the output video in pixels (default: 512). Must be divisible by 8. | INT | Yes | 32 to MAX_RESOLUTION |
| `height` | The height of the output video in pixels (default: 896). Must be divisible by 8. | INT | Yes | 32 to MAX_RESOLUTION |
| `length` | The number of frames in the video (default: 81). Must be divisible by 4. | INT | Yes | 1 to MAX_RESOLUTION |
| `batch_size` | The number of videos to generate in a batch (default: 1). | INT | Yes | 1 to 4096 |
| `clip_vision_output` | Optional CLIP vision output for conditioning. | CLIP_VISION_OUTPUT | No | - |
| `reference_image` | An optional reference image for conditioning. | IMAGE | No | - |
| `pose_video` | Video used for pose conditioning. Will be downscaled to half the resolution of the main video. | IMAGE | No | - |
| `pose_strength` | Strength of the pose latent (default: 1.0). | FLOAT | Yes | 0.0 to 10.0 |
| `pose_start` | Start step to use pose conditioning (default: 0.0). | FLOAT | Yes | 0.0 to 1.0 |
| `pose_end` | End step to use pose conditioning (default: 1.0). | FLOAT | Yes | 0.0 to 1.0 |

**Note:** The `pose_video` input is processed only for the first `length` frames. The `reference_image` is processed only for the first image in the batch. When `reference_image` is provided, a zero-filled latent of the same size is used for the negative conditioning. When `clip_vision_output` is provided, it is applied to both positive and negative conditioning. The `pose_video` is downscaled to half the resolution of the main video before encoding.

## Outputs

| Output Name | Description | Data Type |
| --- | --- | --- |
| `positive` | The modified positive conditioning, potentially containing embedded reference image latents, CLIP vision output, or pose video latents. | CONDITIONING |
| `negative` | The modified negative conditioning, potentially containing embedded reference image latents, CLIP vision output, or pose video latents. | CONDITIONING |
| `latent` | An empty latent tensor of shape `[batch_size, 16, ((length - 1) // 4) + 1, height // 8, width // 8]`. | LATENT |

> This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! [Edit on GitHub](https://github.com/Comfy-Org/embedded-docs/blob/main/comfyui_embedded_docs/docs/WanSCAILToVideo/en.md)

---
**Source fingerprint (SHA-256):** `01c0912474602c33fa0c3e277db90e0eb83edbcea307a860921bab486d267cc8`
