API Reference

The host backend is a FastAPI application that auto-generates interactive API documentation from its route definitions and Pydantic schemas.

Interactive docs

When the backend is running, the following endpoints serve live, browsable documentation:

URL	Format
`http://<host>:8000/docs`	Swagger UI — interactive request builder with try-it-out.
`http://<host>:8000/redoc`	ReDoc — clean read-only reference.
`http://<host>:8000/openapi.json`	Raw OpenAPI 3.x spec (JSON).

Behind a reverse proxy, the docs are at <base-path>/api/docs, <base-path>/api/redoc, and <base-path>/api/openapi.json.

Authentication

Gateway endpoints (/v1/*) require a valid API key when permanent keys exist — see Gateway → Authentication. All other endpoints (/api/*) are unauthenticated (the tool targets trusted LAN environments).

Health

Method	Path	Description
`GET`	`/health`	Backend health check. Returns `{"status": "ok"}`.
`GET`	`/vllm-version`	Return the latest stable vLLM release tag from GitHub (cached for 1 hour).

Deployments

Prefix: /api/deployments

List & inspect

Method	Path	Description
`GET`	`/`	Return all deployments, newest first, with live metrics and pull progress attached.
`GET`	`/served-name/check`	Check whether a served model name is free, with a suggested alternative. Query params: `name`, `exclude_id`.
`GET`	`/{deployment_id}/manifest`	Self-contained reproducibility manifest. Env var values are omitted (keys only).
`GET`	`/{deployment_id}/logs`	Return the last N lines of the deployment's container log. Query param: `tail` (default 200).
`GET`	`/{deployment_id}/logs/download`	Stream the full persisted log file as a downloadable attachment.

Lifecycle

Method	Path	Description
`POST`	`/`	Create a deployment record without launching a container.
`POST`	`/start`	Create a deployment and launch its vLLM container on the target node.
`POST`	`/from-manifest`	Launch a deployment from an exported manifest. The manifest pins model identity; placement comes from the request.
`POST`	`/plan`	Preview which warm models a deploy would auto-offload, before committing. Read-only.
`POST`	`/stop/{deployment_id}`	Stop the deployment's container on its node and mark it as stopped.
`POST`	`/{deployment_id}/restart`	Re-launch a stopped, expired, or errored deployment on its original node.
`POST`	`/{deployment_id}/extend`	Push the serve deadline forward without restarting the model.
`DELETE`	`/{deployment_id}`	Delete the deployment record and remove it from scoped API keys. Does not stop a running container — call stop first.

Warm cache

Method	Path	Description
`POST`	`/{deployment_id}/pin`	Toggle the pin flag, protecting the deployment from warm-cache auto-eviction.
`POST`	`/{deployment_id}/pause`	Pause a running model by offloading it from GPU to RAM via vLLM sleep mode.
`POST`	`/{deployment_id}/resume`	Wake a paused model back to GPU and resume serving.

Request / response schemas

DeploymentStart (POST /start):

Field	Type	Description
`node_id`	`int`	ID of the node to deploy on.
`model_name`	`string`	HuggingFace model identifier (e.g. `meta-llama/Meta-Llama-3-8B`).
`port`	`int`	Port the vLLM server listens on.
`gpu_memory_fraction`	`float`	Fraction of each GPU's memory to allocate (0.0–1.0).
`gpu_ids`	`int[]`	Specific GPU indices to use; omit for automatic selection.
`tensor_parallel_size`	`int`	Number of GPUs for tensor parallelism.
`extra_args`	`string[]`	Additional CLI arguments passed to the vLLM server.
`env_vars`	`object[]`	Environment variables injected into the container (`{key, value}`).
`vllm_version`	`string`	vLLM Docker image tag to use; omit for latest stable.
`engine_args`	`object`	Structured vLLM engine flags (`max_model_len`, `dtype`, `quantization`, etc.).
`lora_modules`	`object[]`	LoRA adapters served alongside the base model (`{name, path}`).
`max_failed_restarts`	`int`	Crash-loop restart threshold; omit to use the cluster default.
`owner`	`string`	User or team launching this deployment (required).
`duration_seconds`	`int`	Serve duration in seconds; omit for indefinite serving.
`pinned`	`bool`	Pin this deployment to prevent warm-cache auto-eviction. Default `false`.
`skip_resource_check`	`bool`	Bypass the client's GPU memory pre-check. Default `false`.

DeploymentRead (response for most deployment endpoints) extends the start fields with:

Field	Type	Description
`id`	`int`	Deployment ID.
`status`	`string`	Lifecycle status: `stopped`, `starting`, `loading`, `running`, `paused_ram`, `stopping`, `error`, `expired`.
`expires_at`	`datetime`	When this deployment will auto-stop; null for indefinite.
`image_digest`	`string`	Docker image digest reported by the client.
`last_error`	`string`	Last failure reason.
`detail`	`string`	Load phase while status is `loading` (downloading / loading_weights / compiling).
`total_prompt_tokens`	`int`	Cumulative prompt tokens processed.
`total_completion_tokens`	`int`	Cumulative completion tokens generated.
`total_requests`	`int`	Cumulative requests served.
`prompt_tps`	`float`	Live prompt tokens/s (per-request, idle-free).
`generation_tps`	`float`	Live generation tokens/s (per-request, idle-free).
`prompt_throughput`	`float`	Engine-wide prompt throughput.
`generation_throughput`	`float`	Engine-wide generation throughput.
`requests_running`	`int`	Requests currently being processed.
`requests_waiting`	`int`	Requests queued.
`pull_percent`	`float`	Image-pull progress percentage (while starting).

DeploymentRestart (POST /{id}/restart):

Field	Type	Description
`owner`	`string`	User or team restarting this deployment.
`duration_seconds`	`int`	New serve duration in seconds; omit for indefinite.

DeploymentExtend (POST /{id}/extend):

Field	Type	Description
`hours`	`float`	Hours to add to the deadline (max 336). Mutually exclusive with `infinite`.
`infinite`	`bool`	Drop the expiry entirely. Mutually exclusive with `hours`.

DeploymentPin (POST /{id}/pin):

Field	Type	Description
`pinned`	`bool`	Whether the deployment is protected from warm-cache auto-eviction.

DeploymentPause (POST /{id}/pause):

Field	Type	Description
`tier`	`string`	Target tier: `ram`. Omit for automatic selection.

DeploymentPlanRequest (POST /plan):

Field	Type	Description
`node_id`	`int`	Target node.
`model_name`	`string`	Model to deploy.
`port`	`int`	Target port.
`gpu_memory_fraction`	`float`	GPU memory fraction.
`gpu_ids`	`int[]`	Specific GPU indices.

DeploymentPlanRead (response):

Field	Type	Description
`fits`	`bool`	Whether the deploy fits (possibly after offloads).
`warm_enabled`	`bool`	False when the node has warm-offload off.
`would_offload`	`OffloadItem[]`	Models that would be evicted.
`blocked_reason`	`string`	Why no plan can make room (when `fits` is false).

DeploymentFromManifest (POST /from-manifest):

Field	Type	Description
`manifest`	`object`	Exported manifest (from `GET /{id}/manifest`).
`node_id`	`int`	Target node.
`port`	`int`	Target port.
`owner`	`string`	User or team.
`duration_seconds`	`int`	Serve duration; omit for indefinite.
`env_vars`	`object[]`	Env var values (never travel in manifests; re-supply here).
`skip_resource_check`	`bool`	Bypass GPU memory pre-check.

Nodes

Prefix: /api/nodes

List & manage

Method	Path	Description
`GET`	`/`	List all registered nodes with GPU metrics, disk usage, and status.
`POST`	`/`	Register a new node manually (nodes normally register via Consul).
`DELETE`	`/{node_id}`	Remove a node and its deployment/metric records. Live nodes re-register via Consul within seconds.
`GET`	`/discovered`	Return nodes discovered via Consul that haven't been registered yet.

Configuration

Method	Path	Description
`POST`	`/{node_id}/runtime`	Set the node's container runtime override (`docker` / `podman` / `null` for auto).
`POST`	`/{node_id}/warm-cache`	Enable/disable warm-cache auto-offload and set the RAM-cache budget.
`POST`	`/{node_id}/maintenance`	Cordon/uncordon specific GPUs (or all); optionally drain affected deployments.

Monitoring

Method	Path	Description
`GET`	`/{node_id}/metrics/history`	Metric samples for the last N minutes, downsampled to every step-th row. Query params: `minutes` (default 60), `step` (default 1).
`GET`	`/{node_id}/ports/check`	Check whether a port is available on the node. Query param: `port`.

Containers & processes

Method	Path	Description
`GET`	`/{node_id}/containers`	List vLLM containers running on the node.
`POST`	`/{node_id}/containers/{container_id}/stop`	Stop and remove a container on the node.
`GET`	`/{node_id}/gpu-processes`	List GPU processes on the node.
`POST`	`/{node_id}/gpu-processes/{pid}/kill`	Kill a GPU process on the node.

Container images

Method	Path	Description
`GET`	`/{node_id}/images`	List container images on the node.
`DELETE`	`/{node_id}/images/{image_id}`	Delete a container image from the node.
`POST`	`/{node_id}/images/prune`	Prune unused container images on the node.

Warm-cache artifacts

Method	Path	Description
`GET`	`/{node_id}/warm-artifacts`	List orphaned warm-cache artifacts (RAM sleepers, disk caches).
`POST`	`/{node_id}/warm-artifacts/sleepers/{pid}/kill`	Kill an orphaned RAM sleeper process.
`DELETE`	`/{node_id}/warm-artifacts/caches/{name}`	Delete an orphaned warm-cache compile artifact.

Model cache (HuggingFace)

Method	Path	Description
`GET`	`/{node_id}/models/cache`	List HuggingFace model cache entries on the node.
`DELETE`	`/{node_id}/models/cache/{name}`	Delete a cached model from the node's HuggingFace cache.

Local models

Method	Path	Description
`GET`	`/{node_id}/local-models`	List models in the node's local model directory.
`DELETE`	`/{node_id}/local-models/{name}`	Delete a model from local storage.
`POST`	`/{node_id}/local-models/pull`	Pull a model from a URL to the node's local storage.
`GET`	`/{node_id}/local-models/transfers`	List active download/upload transfers on the node.

Streamed upload (multi-file):

Method	Path	Description
`POST`	`/{node_id}/local-models/upload/begin`	Start a streamed multi-file model upload session.
`PUT`	`/{node_id}/local-models/upload/{session_id}/file`	Upload a single file within an active upload session. Query param: `path`.
`POST`	`/{node_id}/local-models/upload/{session_id}/finish`	Finalize a model upload session.
`POST`	`/{node_id}/local-models/upload/{session_id}/abort`	Abort and clean up an upload session.
`POST`	`/{node_id}/local-models/archive`	Upload a model as a single archive (tar/zip). Query params: `name`, `filename`.

Packages

Method	Path	Description
`POST`	`/{node_id}/packages/upload`	Upload a pip package wheel to the node's package cache.
`GET`	`/{node_id}/packages`	List pip packages in the node's package cache.

Request / response schemas

NodeRead (response):

Field	Type	Description
`id`	`int`	Node ID.
`hostname`	`string`	Machine hostname.
`ip_address`	`string`	Reachable IP address of the node agent.
`port`	`int`	Agent HTTP port.
`status`	`string`	Health status: `online`, `degraded`, `critical`, `maintenance`, `unknown`.
`gpu_usage`	`object[]`	Per-GPU utilization and memory stats from the latest scrape.
`disk_usage`	`object`	Disk usage breakdown (`total_gb`, `free_gb`, `hf_cache_gb`).
`maintenance`	`bool`	True when all GPUs are under maintenance.
`partial_maintenance`	`bool`	True when some (but not all) GPUs are under maintenance.
`maintenance_gpus`	`int[]`	GPU indices currently under maintenance.
`warm_offload_enabled`	`bool`	Whether warm-cache auto-offload is enabled.
`ram_cache_limit_mb`	`int`	CPU RAM budget in MB for warm-cached models; null = unlimited.
`ram_cache_used_mb`	`float`	CPU RAM currently held by paused models.
`available_runtimes`	`string[]`	Container runtimes detected (`docker` / `podman`).
`container_runtime`	`string`	Per-node runtime override; null = auto.
`rogue_container_count`	`int`	Untracked vLLM containers detected.
`rogue_process_count`	`int`	Orphaned GPU processes with no live container.
`rogue_artifact_count`	`int`	Orphaned warm-cache artifacts (RAM + disk).

NodeMaintenanceRequest (POST /{id}/maintenance):

Field	Type	Description
`gpu_ids`	`int[]`	GPU indices to cordon/uncordon; empty = all GPUs.
`enabled`	`bool`	`true` to cordon, `false` to uncordon.
`drain`	`bool`	Stop deployments on the cordoned GPUs. Default `false`.

NodeWarmCacheRequest (POST /{id}/warm-cache):

Field	Type	Description
`enabled`	`bool`	Enable or disable warm-cache auto-offload.
`ram_cache_limit_mb`	`int`	RAM budget in MB; null = unlimited.

Configs

Prefix: /api/configs

Method	Path	Description
`GET`	`/`	List all saved deployment configurations.
`POST`	`/`	Save a deployment configuration for later reuse.
`DELETE`	`/{config_id}`	Delete a saved configuration.

DeploymentConfigCreate (POST /):

Field	Type	Description
`name`	`string`	Unique name for this saved configuration.
`payload`	`object`	Deployment parameters snapshot (model_name, port, gpu_memory_fraction, etc.).

Settings

Prefix: /api/settings

Method	Path	Description
`GET`	`/`	Return all current runtime settings with their effective values.
`PUT`	`/`	Apply a partial update to runtime settings; only provided fields are changed.

RuntimeSettingsUpdate (PUT /):

All fields are optional — only provided fields are applied.

Field	Type	Default	Description
`gateway_enabled`	`bool`	`true`	Enable or disable the OpenAI-compatible /v1 gateway.
`gateway_timeout_seconds`	`int`	`600`	Non-streaming request timeout in seconds (10–86400).
`start_timeout_seconds`	`int`	`1800`	Mark deployment errored if stuck starting this long (60–86400).
`preferred_container_runtime`	`string`	`docker`	Preferred runtime when a node has both Docker and Podman.
`default_port`	`int`	`8001`	Default port pre-filled in the deploy form (1024–65535).
`default_gpu_fraction`	`float`	`0.5`	Default GPU memory fraction (0.05–1.0).
`default_duration_choice`	`string`	`43200`	Default serve-duration choice for the deploy form (seconds).
`default_vllm_version`	`string`	`""`	Default vLLM version pre-filled in the deploy form.
`default_max_failed_restarts`	`int`	—	Crash-loop threshold for new deployments (1–20); null = client default.
`webhook_url`	`string`	`""`	Webhook URL for deployment lifecycle notifications.
`expiry_warning_minutes`	`int`	`30`	Warn this many minutes before a deployment expires (1–1440).
`node_metrics_retention_hours`	`int`	`48`	Hours of node metric history to retain (1–8760).
`temp_api_key_ttl_seconds`	`int`	`300`	Temporary API key lifespan in seconds (0–3600).
`default_warm_offload_enabled`	`bool`	`true`	Enable warm cache by default on newly discovered nodes.
`busy_guard_seconds`	`int`	`0`	Seconds after last request before a model can be auto-evicted (0–300).
`nodes_sync_interval_seconds`	`int`	`10`	Node sync loop interval (1–300).
`deployments_sync_interval_seconds`	`int`	`5`	Deployment sync loop interval (1–300).
`expiry_check_interval_seconds`	`int`	`30`	Expiry check loop interval (5–600).
`node_failure_threshold`	`int`	`3`	Consecutive failures before a node turns critical (1–20).
`deployment_failure_threshold`	`int`	`3`	Unreachable polls before a deployment degrades (1–20).

API Keys

Prefix: /api/api-keys

Method	Path	Description
`POST`	`/`	Create a new API key (permanent or temporary).
`GET`	`/`	List all active (non-expired) API keys.
`PATCH`	`/{key_id}`	Update a permanent key's label or deployment scope.
`DELETE`	`/{key_id}`	Revoke and delete an API key.

CreateApiKeyRequest (POST /):

Field	Type	Description
`label`	`string`	Human-readable name for this key (1–128 chars).
`ttl_seconds`	`int`	Key lifetime in seconds (1–86400); omit for a permanent key.
`deployment_ids`	`int[]`	Restrict this key to specific deployment IDs. Null = all deployments. Required for temporary keys.

UpdateApiKeyRequest (PATCH /{key_id}):

Field	Type	Description
`label`	`string`	Updated label.
`deployment_ids`	`int[]`	Updated deployment scope. Null = all deployments.

Gateway

Prefix: /v1

All gateway endpoints require a Bearer API key when permanent keys exist. See Gateway → Authentication.

Method	Path	Description
`POST`	`/chat/completions`	Proxy an OpenAI-compatible chat completion request to the appropriate vLLM instance.
`POST`	`/completions`	Proxy an OpenAI-compatible text completion request.
`POST`	`/embeddings`	Proxy an embeddings request.
`GET`	`/models`	Return available models in OpenAI `/v1/models` format, filtered by API key scope.

The gateway routes by the model field in the request body. Streaming is supported via "stream": true. Error responses follow the OpenAI error format:

{"error": {"message": "...", "type": "...", "param": null, "code": "..."}}

Code	Meaning
`model_not_found`	No running deployment serves this model name.
`model_ambiguous`	Multiple deployments match — use a specific `served_model_name`.
`model_loading`	Model is still starting up; retry shortly.
`deployment_unreachable`	The node hosting this model is not responding.
`missing_model`	Request body is missing the required `model` field.
`invalid_json`	Request body is not valid JSON.

Admin

Prefix: /api/admin

Method	Path	Description
`POST`	`/purge`	Delete database records by category; running containers are untouched.

PurgeRequest (POST /purge):

Field	Type	Description
`targets`	`string[]`	Categories to purge: `deployments`, `nodes`, `metrics`, `configs`. Omit for all. Purging `nodes` automatically includes `deployments` and `metrics`.

WebSocket

Method	Path	Description
`WebSocket`	`/ws`	Accept a WebSocket connection for live state-change notifications.

The backend pushes JSON messages to notify clients of state changes. See Architecture → WebSocket events for the message types and payload format.