The HunyuanVideo15ImageToVideo node prepares conditioning and latent space data for video generation based on the HunyuanVideo 1.5 model. It creates an initial latent representation for a video sequence and can optionally integrate a starting image or a CLIP vision output to guide the generation process.

## Inputs

| Parameter | Description | Data Type | Required | Range |
| --- | --- | --- | --- | --- |
| `positive` | The positive conditioning prompts that describe what the video should contain. | CONDITIONING | Yes | - |
| `negative` | The negative conditioning prompts that describe what the video should avoid. | CONDITIONING | Yes | - |
| `vae` | The VAE (Variational Autoencoder) model used to encode the starting image into the latent space. | VAE | Yes | - |
| `width` | The width of the output video frames in pixels. Must be divisible by 16. (default: 848) | INT | No | 16 to MAX_RESOLUTION, step: 16 |
| `height` | The height of the output video frames in pixels. Must be divisible by 16. (default: 480) | INT | No | 16 to MAX_RESOLUTION, step: 16 |
| `length` | The total number of frames in the video sequence. Must be a multiple of 4. (default: 33) | INT | No | 1 to MAX_RESOLUTION, step: 4 |
| `batch_size` | The number of video sequences to generate in a single batch. (default: 1) | INT | No | 1 to 4096 |
| `start_image` | An optional starting image to initialize the video generation. If provided, it is encoded and used to condition the first frames. Only the first `length` frames of the image are used. | IMAGE | No | - |
| `clip_vision_output` | Optional CLIP vision embeddings to provide additional visual conditioning for the generation. | CLIP_VISION_OUTPUT | No | - |

**Note:** When a `start_image` is provided, it is automatically resized to match the specified `width` and `height` using bilinear interpolation. The first `length` frames of the image batch are used. The encoded image is then added to both the `positive` and `negative` conditioning as a `concat_latent_image` with a corresponding `concat_mask`. The mask is set to 0.0 for the frames covered by the starting image and 1.0 for the remaining frames.

## Outputs

| Output Name | Description | Data Type |
| --- | --- | --- |
| `positive` | The modified positive conditioning, which may now include the encoded starting image or CLIP vision output. | CONDITIONING |
| `negative` | The modified negative conditioning, which may now include the encoded starting image or CLIP vision output. | CONDITIONING |
| `latent` | An empty latent tensor with dimensions configured for the specified batch size, video length, width, and height. | LATENT |

> This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! [Edit on GitHub](https://github.com/Comfy-Org/embedded-docs/blob/main/comfyui_embedded_docs/docs/HunyuanVideo15ImageToVideo/en.md)

---
**Source fingerprint (SHA-256):** `383b965a2e67c3643a13991ea5969c4d31ce17e48a57a400f89974f64e4b1e04`
