# WanSCAILToVideo

The WanSCAILToVideo node prepares conditioning and an empty latent space for video generation. It processes optional inputs like reference images, pose videos, CLIP vision outputs, and previous frame chunks, embedding them into the positive and negative conditioning for a video model. The node outputs the modified conditioning and a blank latent tensor of the specified video dimensions.

## Inputs

| Parameter | Description | Data Type | Required | Range |
|-----------|-------------|-----------|----------|-------|
| `positive` | The positive conditioning input. | CONDITIONING | Yes | - |
| `negative` | The negative conditioning input. | CONDITIONING | Yes | - |
| `vae` | The VAE model used for encoding images and video frames. | VAE | Yes | - |
| `width` | The width of the output video in pixels (default: 512). Must be divisible by 32. | INT | Yes | 32 to MAX_RESOLUTION |
| `height` | The height of the output video in pixels (default: 896). Must be divisible by 32. | INT | Yes | 32 to MAX_RESOLUTION |
| `length` | The number of frames in the video (default: 81). Must be divisible by 4. | INT | Yes | 1 to MAX_RESOLUTION |
| `batch_size` | The number of videos to generate in a batch (default: 1). | INT | Yes | 1 to 4096 |
| `pose_video` | Video used for pose conditioning. Will be downscaled to half the resolution of the main video. | IMAGE | No | - |
| `pose_video_mask` | SCAIL-2 only. Colored per-identity SAM3 mask video at the same resolution as pose_video. | IMAGE | No | - |
| `replacement_mode` | SCAIL-2 only. False = Animation Mode (pose_video_mask should have black background). True = Replacement Mode (pose_video_mask should have white background). Default: False. | BOOLEAN | No | - |
| `pose_strength` | Strength of the pose latent (default: 1.0). | FLOAT | Yes | 0.0 to 10.0 |
| `pose_start` | Start step of the pose conditioning (default: 0.0). | FLOAT | Yes | 0.0 to 1.0 |
| `pose_end` | End step of the pose conditioning (default: 1.0). | FLOAT | Yes | 0.0 to 1.0 |
| `reference_image` | Reference image, for multiple references composite all on single image. | IMAGE | No | - |
| `reference_image_mask` | SCAIL-2 only. Colored reference mask at the same resolution as reference_image. | IMAGE | No | - |
| `clip_vision_output` | CLIP vision features for conditioning. Model is trained with stretch resize to aspect ratio. | CLIP_VISION_OUTPUT | No | - |
| `video_frame_offset` | Cumulative output frame this chunk begins at. Wire from the previous chunk's video_frame_offset output (default: 0). | INT | Yes | 0 to MAX_RESOLUTION |
| `previous_frame_count` | Tail frames of previous_frames to anchor. SCAIL-2 trained at 5 (81-frame chunks, 76-frame step) (default: 5). | INT | Yes | 1 to MAX_RESOLUTION |
| `previous_frames` | SCAIL-2 only. Full decoded output of the previous chunk. Only the last previous_frame_count are used as the extension anchor. | IMAGE | No | - |

**Note:** The `pose_video` and `pose_video_mask` inputs are processed only for the first `length` frames. The `reference_image` is processed only for the first image in the batch. When `reference_image` is provided, it is encoded into a latent and embedded into both positive and negative conditioning. When `clip_vision_output` is provided, it is applied to both positive and negative conditioning. The `pose_video` is downscaled to half the resolution of the main video before encoding. When `previous_frames` is provided, only the last `previous_frame_count` frames are used as the extension anchor, and the `video_frame_offset` is adjusted accordingly. In Replacement Mode (`replacement_mode=True`), the reference image is composited on a black background using the reference image mask as an alpha matte.

## Outputs

| Output Name | Description | Data Type |
|-------------|-------------|-----------|
| `positive` | The modified positive conditioning, potentially containing embedded reference image latents, CLIP vision output, pose video latents, driving masks, reference masks, or previous frame latents. | CONDITIONING |
| `negative` | The modified negative conditioning, potentially containing embedded reference image latents, CLIP vision output, pose video latents, driving masks, reference masks, or previous frame latents. | CONDITIONING |
| `latent` | An empty latent tensor of shape `[batch_size, 16, ((length - 1) // 4) + 1, height // 8, width // 8]`. When previous_frames is provided, the latent is partially filled with encoded previous frames and a noise mask is included. | LATENT |
| `video_frame_offset` | Adjusted offset + length. Wire into the next chunk for sequential video generation. | INT |

> This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! [Edit on GitHub](https://github.com/Comfy-Org/embedded-docs/blob/main/comfyui_embedded_docs/docs/WanSCAILToVideo/en.md)

---
**Source fingerprint (SHA-256):** `30e14959248c46e624e2ce2e3d079cd5aad94c12b66d74d4979ef70143b871e3`