CoderAI operations guide

Frontend / engine split

CoderAI can boot as a thin public frontend plus one or more internal engine processes. The front imports no heavy torch/transformer/diffuser stacks, keeps the web UI responsive, streams HTTP/SSE to engines, and serves an aggregated status/tasks view.

client ─HTTP/SSE─▶ front (public) ─┬─ engine#0 (CUDA_VISIBLE_DEVICES=0, :8780)
                                  ├─ engine#1 (CUDA_VISIBLE_DEVICES=1, :8781)
                                  └─ ...

Launch	Result
`coderai`	Front on public port; auto-spawns one engine per GPU by default.
`coderai --single-process`	Legacy one-process mode.
`coderai --engine-only --internal-port N`	Internal engine mode normally managed by the front supervisor.

Engine routing and heterogeneous GPUs

Auto-detection favours NVIDIA for one-engine-per-GPU, while mixed systems can declare server.engine_specs with backend, env, and capability settings. A transformers/safetensors model goes to a transformers-capable engine; GGUF can route to a compatible NVIDIA or Vulkan engine.

Per-model engine pin if compatible.
Already-resident model to avoid reloads.
Configured default engine.
Least-loaded compatible engine.

Resource management

VRAM / RAM / disk offload

Models can offload based on per-model settings, GPU limits, and server-wide host RAM caps.

Queues

Requests are queued and processed per model/engine. Concurrency limits can be defaulted and overridden per engine.

Prompt cache and aggregation

Prompt caching reuses KV cache; prompt aggregation can batch concurrent requests into one inference pass.

Thermal protection

CPU heat pauses all engines; GPU heat pauses only the owning engine, with per-vendor threshold overrides.

Admin, auth, archive

Web sessions use signed cookies; API clients use bearer tokens.
Generated files can be auto-saved to an archive with retention such as 1h, 1d, 1w, 1m, 1y, or never.
The archive can be browsed and deleted via Web Studio and API.
Default admin credentials are for first boot only and should be changed immediately.

Troubleshooting checklist

Model returns 503: check model id, backend compatibility, VRAM/RAM limits, and whether the engine is cooling or wedged.
Vulkan unavailable: verify drivers, ICD files, and GGML_VK_VISIBLE_DEVICES.
stable-diffusion.cpp uses CPU: verify CUDA/Vulkan build flags and runtime libraries.
Broker connected but no routed work: check provider_id, client_id, owner scope, and registration token.
UI hangs during generation: prefer front/engine split rather than single-process mode.