# ElevenLabs Speech to Text

The ElevenLabs Speech to Text node transcribes audio files into text. It uses ElevenLabs' API to convert spoken words into a written transcript, supporting features like automatic language detection, identifying different speakers, and tagging non-speech sounds like music or laughter.

## Inputs

| Parameter | Description | Data Type | Required | Range |
| --- | --- | --- | --- | --- |
| `audio` | Audio to transcribe. | AUDIO | Yes | - |
| `model` | Model to use for transcription. Selecting this model reveals additional parameters. | COMBO | Yes | `"scribe_v2"` |
| `tag_audio_events` | Annotate sounds like (laughter), (music), etc. in transcript. This parameter is revealed when the `"scribe_v2"` model is selected. (default: False) | BOOLEAN | No | - |
| `diarize` | Annotate which speaker is talking. This parameter is revealed when the `"scribe_v2"` model is selected. (default: False) | BOOLEAN | No | - |
| `diarization_threshold` | Speaker separation sensitivity. Lower values are more sensitive to speaker changes. This parameter is revealed when the `"scribe_v2"` model is selected and `diarize` is enabled. (default: 0.22) | FLOAT | No | 0.1 - 0.4 |
| `temperature` | Randomness control. 0.0 uses model default. Higher values increase randomness. This parameter is revealed when the `"scribe_v2"` model is selected. (default: 0.0) | FLOAT | No | 0.0 - 2.0 |
| `timestamps_granularity` | Timing precision for transcript words. This parameter is revealed when the `"scribe_v2"` model is selected. (default: "word") | COMBO | No | `"word"`<br>`"character"`<br>`"none"` |
| `language_code` | ISO-639-1 or ISO-639-3 language code (e.g., 'en', 'es', 'fra'). Leave empty for automatic detection. (default: "") | STRING | No | - |
| `num_speakers` | Maximum number of speakers to predict. Set to 0 for automatic detection. (default: 0) | INT | No | 0 - 32 |
| `seed` | Seed for reproducibility (determinism not guaranteed). (default: 1) | INT | No | 0 - 2147483647 |

**Note:** The `num_speakers` parameter cannot be set to a value greater than 0 when the `diarize` option is enabled. You must either disable `diarize` or set `num_speakers` to 0.

## Outputs

| Output Name | Description | Data Type |
| --- | --- | --- |
| `text` | The transcribed text from the audio. | STRING |
| `language_code` | The detected language code of the audio. | STRING |
| `words_json` | A JSON-formatted string containing detailed word-level information, including timestamps and speaker labels if enabled. | STRING |

> This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! [Edit on GitHub](https://github.com/Comfy-Org/embedded-docs/blob/main/comfyui_embedded_docs/docs/ElevenLabsSpeechToText/en.md)

---
**Source fingerprint (SHA-256):** `7eb5d72615aa8a9e4a8014e45b39cf83dc8d8432d7ce0dccba20489be80a5830`