---
name: local-llm-stack
description: "Discover, download, run, serve, and evaluate local LLMs across the full open-model stack."
version: 1.0.0
author: Hermes Agent
license: MIT
platforms: [linux, macos, windows]
metadata:
  hermes:
    tags: [local-llm, inference, serving, evaluation, HuggingFace, llama.cpp, vLLM, Hermes]
---

# Local LLM Stack

Use this skill when the user wants to work with local or self-hosted models instead of a hosted API.

This covers the full lifecycle:

1. find the model
2. download or mirror it
3. choose the runtime
4. serve or run it locally
5. benchmark it
6. wire it into Hermes if needed

## Model discovery and download

Use Hugging Face when you need to search the ecosystem, inspect repo contents, or pull weights and datasets.

Typical tasks:

- search for candidate models or datasets
- inspect repo trees for exact filenames
- download checkpoints or GGUF files
- mirror files into a local cache or app directory
- upload artifacts back to the Hub

Keep filenames exact when a runtime needs them.

## Choose the runtime

### `llama.cpp`

Best for:

- CPU inference
- Apple Silicon
- small-footprint local serving
- GGUF workflows
- edge and single-user use cases

Use it when you need a compact, local, URL-first workflow and the model is already available in GGUF form.

### `vLLM`

Best for:

- high-throughput serving
- OpenAI-compatible endpoints
- batched inference
- multi-user or production-style deployments
- quantized server deployments

Use it when serving matters more than local simplicity.

### Hermes-local OpenAI-compatible providers

Best for:

- pointing Hermes at a local model endpoint
- keeping a separate local profile
- verifying the endpoint at the CLI, API, and Hermes layers

Use a dedicated profile rather than mutating your default profile in place.

## Evaluation and comparison

Use lm-eval-harness when you need to compare models or track a model over time.

Good fits:

- benchmark a candidate before deployment
- compare quantization variants
- track training checkpoints
- report standard scores

## Practical workflow

A strong local-LLM workflow usually looks like this:

1. choose the model family and size
2. decide whether the file format is raw weights, quantized weights, or GGUF
3. pick the runtime that matches the hardware
4. verify the endpoint with a tiny prompt
5. benchmark if the model will be reused
6. integrate it into Hermes only after the endpoint is stable

## Pitfalls

- Do not assume the same model name implies the same file format.
- Distinguish serving performance from model quality.
- Check RAM or VRAM fit before downloading a large artifact.
- Keep local model config isolated from your default Hermes profile.
- For large downloads, prefer resumable transfers or background execution.

## Good output

When using this skill, report:

- model name and format
- chosen runtime
- endpoint or command used
- verification result
- benchmark result if run
