API Reference
The host backend is a FastAPI application that auto-generates interactive API documentation from its route definitions and Pydantic schemas.
Interactive docs
When the backend is running, the following endpoints serve live, browsable documentation:
| URL |
Format |
http://<host>:8000/docs |
Swagger UI — interactive request builder with try-it-out. |
http://<host>:8000/redoc |
ReDoc — clean read-only reference. |
http://<host>:8000/openapi.json |
Raw OpenAPI 3.x spec (JSON). |
Behind a reverse proxy, the docs are at <base-path>/api/docs, <base-path>/api/redoc, and <base-path>/api/openapi.json.
Authentication
Gateway endpoints (/v1/*) require a valid API key when permanent keys exist — see Gateway → Authentication. All other endpoints (/api/*) are unauthenticated (the tool targets trusted LAN environments).
Health
| Method |
Path |
Description |
GET |
/health |
Backend health check. Returns {"status": "ok"}. |
GET |
/vllm-version |
Return the latest stable vLLM release tag from GitHub (cached for 1 hour). |
Deployments
Prefix: /api/deployments
List & inspect
| Method |
Path |
Description |
GET |
/ |
Return all deployments, newest first, with live metrics and pull progress attached. |
GET |
/served-name/check |
Check whether a served model name is free, with a suggested alternative. Query params: name, exclude_id. |
GET |
/{deployment_id}/manifest |
Self-contained reproducibility manifest. Env var values are omitted (keys only). |
GET |
/{deployment_id}/logs |
Return the last N lines of the deployment's container log. Query param: tail (default 200). |
GET |
/{deployment_id}/logs/download |
Stream the full persisted log file as a downloadable attachment. |
Lifecycle
| Method |
Path |
Description |
POST |
/ |
Create a deployment record without launching a container. |
POST |
/start |
Create a deployment and launch its vLLM container on the target node. |
POST |
/from-manifest |
Launch a deployment from an exported manifest. The manifest pins model identity; placement comes from the request. |
POST |
/plan |
Preview which warm models a deploy would auto-offload, before committing. Read-only. |
POST |
/stop/{deployment_id} |
Stop the deployment's container on its node and mark it as stopped. |
POST |
/{deployment_id}/restart |
Re-launch a stopped, expired, or errored deployment on its original node. |
POST |
/{deployment_id}/extend |
Push the serve deadline forward without restarting the model. |
DELETE |
/{deployment_id} |
Delete the deployment record and remove it from scoped API keys. Does not stop a running container — call stop first. |
Warm cache
| Method |
Path |
Description |
POST |
/{deployment_id}/pin |
Toggle the pin flag, protecting the deployment from warm-cache auto-eviction. |
POST |
/{deployment_id}/pause |
Pause a running model by offloading it from GPU to RAM via vLLM sleep mode. |
POST |
/{deployment_id}/resume |
Wake a paused model back to GPU and resume serving. |
Request / response schemas
DeploymentStart (POST /start):
| Field |
Type |
Description |
node_id |
int |
ID of the node to deploy on. |
model_name |
string |
HuggingFace model identifier (e.g. meta-llama/Meta-Llama-3-8B). |
port |
int |
Port the vLLM server listens on. |
gpu_memory_fraction |
float |
Fraction of each GPU's memory to allocate (0.0–1.0). |
gpu_ids |
int[] |
Specific GPU indices to use; omit for automatic selection. |
tensor_parallel_size |
int |
Number of GPUs for tensor parallelism. |
extra_args |
string[] |
Additional CLI arguments passed to the vLLM server. |
env_vars |
object[] |
Environment variables injected into the container ({key, value}). |
vllm_version |
string |
vLLM Docker image tag to use; omit for latest stable. |
engine_args |
object |
Structured vLLM engine flags (max_model_len, dtype, quantization, etc.). |
lora_modules |
object[] |
LoRA adapters served alongside the base model ({name, path}). |
max_failed_restarts |
int |
Crash-loop restart threshold; omit to use the cluster default. |
owner |
string |
User or team launching this deployment (required). |
duration_seconds |
int |
Serve duration in seconds; omit for indefinite serving. |
pinned |
bool |
Pin this deployment to prevent warm-cache auto-eviction. Default false. |
skip_resource_check |
bool |
Bypass the client's GPU memory pre-check. Default false. |
DeploymentRead (response for most deployment endpoints) extends the start fields with:
| Field |
Type |
Description |
id |
int |
Deployment ID. |
status |
string |
Lifecycle status: stopped, starting, loading, running, paused_ram, stopping, error, expired. |
expires_at |
datetime |
When this deployment will auto-stop; null for indefinite. |
image_digest |
string |
Docker image digest reported by the client. |
last_error |
string |
Last failure reason. |
detail |
string |
Load phase while status is loading (downloading / loading_weights / compiling). |
total_prompt_tokens |
int |
Cumulative prompt tokens processed. |
total_completion_tokens |
int |
Cumulative completion tokens generated. |
total_requests |
int |
Cumulative requests served. |
prompt_tps |
float |
Live prompt tokens/s (per-request, idle-free). |
generation_tps |
float |
Live generation tokens/s (per-request, idle-free). |
prompt_throughput |
float |
Engine-wide prompt throughput. |
generation_throughput |
float |
Engine-wide generation throughput. |
requests_running |
int |
Requests currently being processed. |
requests_waiting |
int |
Requests queued. |
pull_percent |
float |
Image-pull progress percentage (while starting). |
DeploymentRestart (POST /{id}/restart):
| Field |
Type |
Description |
owner |
string |
User or team restarting this deployment. |
duration_seconds |
int |
New serve duration in seconds; omit for indefinite. |
DeploymentExtend (POST /{id}/extend):
| Field |
Type |
Description |
hours |
float |
Hours to add to the deadline (max 336). Mutually exclusive with infinite. |
infinite |
bool |
Drop the expiry entirely. Mutually exclusive with hours. |
DeploymentPin (POST /{id}/pin):
| Field |
Type |
Description |
pinned |
bool |
Whether the deployment is protected from warm-cache auto-eviction. |
DeploymentPause (POST /{id}/pause):
| Field |
Type |
Description |
tier |
string |
Target tier: ram. Omit for automatic selection. |
DeploymentPlanRequest (POST /plan):
| Field |
Type |
Description |
node_id |
int |
Target node. |
model_name |
string |
Model to deploy. |
port |
int |
Target port. |
gpu_memory_fraction |
float |
GPU memory fraction. |
gpu_ids |
int[] |
Specific GPU indices. |
DeploymentPlanRead (response):
| Field |
Type |
Description |
fits |
bool |
Whether the deploy fits (possibly after offloads). |
warm_enabled |
bool |
False when the node has warm-offload off. |
would_offload |
OffloadItem[] |
Models that would be evicted. |
blocked_reason |
string |
Why no plan can make room (when fits is false). |
DeploymentFromManifest (POST /from-manifest):
| Field |
Type |
Description |
manifest |
object |
Exported manifest (from GET /{id}/manifest). |
node_id |
int |
Target node. |
port |
int |
Target port. |
owner |
string |
User or team. |
duration_seconds |
int |
Serve duration; omit for indefinite. |
env_vars |
object[] |
Env var values (never travel in manifests; re-supply here). |
skip_resource_check |
bool |
Bypass GPU memory pre-check. |
Nodes
Prefix: /api/nodes
List & manage
| Method |
Path |
Description |
GET |
/ |
List all registered nodes with GPU metrics, disk usage, and status. |
POST |
/ |
Register a new node manually (nodes normally register via Consul). |
DELETE |
/{node_id} |
Remove a node and its deployment/metric records. Live nodes re-register via Consul within seconds. |
GET |
/discovered |
Return nodes discovered via Consul that haven't been registered yet. |
Configuration
| Method |
Path |
Description |
POST |
/{node_id}/runtime |
Set the node's container runtime override (docker / podman / null for auto). |
POST |
/{node_id}/warm-cache |
Enable/disable warm-cache auto-offload and set the RAM-cache budget. |
POST |
/{node_id}/maintenance |
Cordon/uncordon specific GPUs (or all); optionally drain affected deployments. |
Monitoring
| Method |
Path |
Description |
GET |
/{node_id}/metrics/history |
Metric samples for the last N minutes, downsampled to every step-th row. Query params: minutes (default 60), step (default 1). |
GET |
/{node_id}/ports/check |
Check whether a port is available on the node. Query param: port. |
Containers & processes
| Method |
Path |
Description |
GET |
/{node_id}/containers |
List vLLM containers running on the node. |
POST |
/{node_id}/containers/{container_id}/stop |
Stop and remove a container on the node. |
GET |
/{node_id}/gpu-processes |
List GPU processes on the node. |
POST |
/{node_id}/gpu-processes/{pid}/kill |
Kill a GPU process on the node. |
Container images
| Method |
Path |
Description |
GET |
/{node_id}/images |
List container images on the node. |
DELETE |
/{node_id}/images/{image_id} |
Delete a container image from the node. |
POST |
/{node_id}/images/prune |
Prune unused container images on the node. |
Warm-cache artifacts
| Method |
Path |
Description |
GET |
/{node_id}/warm-artifacts |
List orphaned warm-cache artifacts (RAM sleepers, disk caches). |
POST |
/{node_id}/warm-artifacts/sleepers/{pid}/kill |
Kill an orphaned RAM sleeper process. |
DELETE |
/{node_id}/warm-artifacts/caches/{name} |
Delete an orphaned warm-cache compile artifact. |
Model cache (HuggingFace)
| Method |
Path |
Description |
GET |
/{node_id}/models/cache |
List HuggingFace model cache entries on the node. |
DELETE |
/{node_id}/models/cache/{name} |
Delete a cached model from the node's HuggingFace cache. |
Local models
| Method |
Path |
Description |
GET |
/{node_id}/local-models |
List models in the node's local model directory. |
DELETE |
/{node_id}/local-models/{name} |
Delete a model from local storage. |
POST |
/{node_id}/local-models/pull |
Pull a model from a URL to the node's local storage. |
GET |
/{node_id}/local-models/transfers |
List active download/upload transfers on the node. |
Streamed upload (multi-file):
| Method |
Path |
Description |
POST |
/{node_id}/local-models/upload/begin |
Start a streamed multi-file model upload session. |
PUT |
/{node_id}/local-models/upload/{session_id}/file |
Upload a single file within an active upload session. Query param: path. |
POST |
/{node_id}/local-models/upload/{session_id}/finish |
Finalize a model upload session. |
POST |
/{node_id}/local-models/upload/{session_id}/abort |
Abort and clean up an upload session. |
POST |
/{node_id}/local-models/archive |
Upload a model as a single archive (tar/zip). Query params: name, filename. |
Packages
| Method |
Path |
Description |
POST |
/{node_id}/packages/upload |
Upload a pip package wheel to the node's package cache. |
GET |
/{node_id}/packages |
List pip packages in the node's package cache. |
Request / response schemas
NodeRead (response):
| Field |
Type |
Description |
id |
int |
Node ID. |
hostname |
string |
Machine hostname. |
ip_address |
string |
Reachable IP address of the node agent. |
port |
int |
Agent HTTP port. |
status |
string |
Health status: online, degraded, critical, maintenance, unknown. |
gpu_usage |
object[] |
Per-GPU utilization and memory stats from the latest scrape. |
disk_usage |
object |
Disk usage breakdown (total_gb, free_gb, hf_cache_gb). |
maintenance |
bool |
True when all GPUs are under maintenance. |
partial_maintenance |
bool |
True when some (but not all) GPUs are under maintenance. |
maintenance_gpus |
int[] |
GPU indices currently under maintenance. |
warm_offload_enabled |
bool |
Whether warm-cache auto-offload is enabled. |
ram_cache_limit_mb |
int |
CPU RAM budget in MB for warm-cached models; null = unlimited. |
ram_cache_used_mb |
float |
CPU RAM currently held by paused models. |
available_runtimes |
string[] |
Container runtimes detected (docker / podman). |
container_runtime |
string |
Per-node runtime override; null = auto. |
rogue_container_count |
int |
Untracked vLLM containers detected. |
rogue_process_count |
int |
Orphaned GPU processes with no live container. |
rogue_artifact_count |
int |
Orphaned warm-cache artifacts (RAM + disk). |
NodeMaintenanceRequest (POST /{id}/maintenance):
| Field |
Type |
Description |
gpu_ids |
int[] |
GPU indices to cordon/uncordon; empty = all GPUs. |
enabled |
bool |
true to cordon, false to uncordon. |
drain |
bool |
Stop deployments on the cordoned GPUs. Default false. |
NodeWarmCacheRequest (POST /{id}/warm-cache):
| Field |
Type |
Description |
enabled |
bool |
Enable or disable warm-cache auto-offload. |
ram_cache_limit_mb |
int |
RAM budget in MB; null = unlimited. |
Configs
Prefix: /api/configs
| Method |
Path |
Description |
GET |
/ |
List all saved deployment configurations. |
POST |
/ |
Save a deployment configuration for later reuse. |
DELETE |
/{config_id} |
Delete a saved configuration. |
DeploymentConfigCreate (POST /):
| Field |
Type |
Description |
name |
string |
Unique name for this saved configuration. |
payload |
object |
Deployment parameters snapshot (model_name, port, gpu_memory_fraction, etc.). |
Settings
Prefix: /api/settings
| Method |
Path |
Description |
GET |
/ |
Return all current runtime settings with their effective values. |
PUT |
/ |
Apply a partial update to runtime settings; only provided fields are changed. |
RuntimeSettingsUpdate (PUT /):
All fields are optional — only provided fields are applied.
| Field |
Type |
Default |
Description |
gateway_enabled |
bool |
true |
Enable or disable the OpenAI-compatible /v1 gateway. |
gateway_timeout_seconds |
int |
600 |
Non-streaming request timeout in seconds (10–86400). |
start_timeout_seconds |
int |
1800 |
Mark deployment errored if stuck starting this long (60–86400). |
preferred_container_runtime |
string |
docker |
Preferred runtime when a node has both Docker and Podman. |
default_port |
int |
8001 |
Default port pre-filled in the deploy form (1024–65535). |
default_gpu_fraction |
float |
0.5 |
Default GPU memory fraction (0.05–1.0). |
default_duration_choice |
string |
43200 |
Default serve-duration choice for the deploy form (seconds). |
default_vllm_version |
string |
"" |
Default vLLM version pre-filled in the deploy form. |
default_max_failed_restarts |
int |
— |
Crash-loop threshold for new deployments (1–20); null = client default. |
webhook_url |
string |
"" |
Webhook URL for deployment lifecycle notifications. |
expiry_warning_minutes |
int |
30 |
Warn this many minutes before a deployment expires (1–1440). |
node_metrics_retention_hours |
int |
48 |
Hours of node metric history to retain (1–8760). |
temp_api_key_ttl_seconds |
int |
300 |
Temporary API key lifespan in seconds (0–3600). |
default_warm_offload_enabled |
bool |
true |
Enable warm cache by default on newly discovered nodes. |
busy_guard_seconds |
int |
0 |
Seconds after last request before a model can be auto-evicted (0–300). |
nodes_sync_interval_seconds |
int |
10 |
Node sync loop interval (1–300). |
deployments_sync_interval_seconds |
int |
5 |
Deployment sync loop interval (1–300). |
expiry_check_interval_seconds |
int |
30 |
Expiry check loop interval (5–600). |
node_failure_threshold |
int |
3 |
Consecutive failures before a node turns critical (1–20). |
deployment_failure_threshold |
int |
3 |
Unreachable polls before a deployment degrades (1–20). |
API Keys
Prefix: /api/api-keys
| Method |
Path |
Description |
POST |
/ |
Create a new API key (permanent or temporary). |
GET |
/ |
List all active (non-expired) API keys. |
PATCH |
/{key_id} |
Update a permanent key's label or deployment scope. |
DELETE |
/{key_id} |
Revoke and delete an API key. |
CreateApiKeyRequest (POST /):
| Field |
Type |
Description |
label |
string |
Human-readable name for this key (1–128 chars). |
ttl_seconds |
int |
Key lifetime in seconds (1–86400); omit for a permanent key. |
deployment_ids |
int[] |
Restrict this key to specific deployment IDs. Null = all deployments. Required for temporary keys. |
UpdateApiKeyRequest (PATCH /{key_id}):
| Field |
Type |
Description |
label |
string |
Updated label. |
deployment_ids |
int[] |
Updated deployment scope. Null = all deployments. |
Gateway
Prefix: /v1
All gateway endpoints require a Bearer API key when permanent keys exist. See Gateway → Authentication.
| Method |
Path |
Description |
POST |
/chat/completions |
Proxy an OpenAI-compatible chat completion request to the appropriate vLLM instance. |
POST |
/completions |
Proxy an OpenAI-compatible text completion request. |
POST |
/embeddings |
Proxy an embeddings request. |
GET |
/models |
Return available models in OpenAI /v1/models format, filtered by API key scope. |
The gateway routes by the model field in the request body. Streaming is supported via "stream": true. Error responses follow the OpenAI error format:
{"error": {"message": "...", "type": "...", "param": null, "code": "..."}}
| Code |
Meaning |
model_not_found |
No running deployment serves this model name. |
model_ambiguous |
Multiple deployments match — use a specific served_model_name. |
model_loading |
Model is still starting up; retry shortly. |
deployment_unreachable |
The node hosting this model is not responding. |
missing_model |
Request body is missing the required model field. |
invalid_json |
Request body is not valid JSON. |
Admin
Prefix: /api/admin
| Method |
Path |
Description |
POST |
/purge |
Delete database records by category; running containers are untouched. |
PurgeRequest (POST /purge):
| Field |
Type |
Description |
targets |
string[] |
Categories to purge: deployments, nodes, metrics, configs. Omit for all. Purging nodes automatically includes deployments and metrics. |
WebSocket
| Method |
Path |
Description |
WebSocket |
/ws |
Accept a WebSocket connection for live state-change notifications. |
The backend pushes JSON messages to notify clients of state changes. See Architecture → WebSocket events for the message types and payload format.