Frontend / engine split
CoderAI can boot as a thin public frontend plus one or more internal engine processes. The front imports no heavy torch/transformer/diffuser stacks, keeps the web UI responsive, streams HTTP/SSE to engines, and serves an aggregated status/tasks view.
client ─HTTP/SSE─▶ front (public) ─┬─ engine#0 (CUDA_VISIBLE_DEVICES=0, :8780)
├─ engine#1 (CUDA_VISIBLE_DEVICES=1, :8781)
└─ ...| Launch | Result |
|---|---|
coderai | Front on public port; auto-spawns one engine per GPU by default. |
coderai --single-process | Legacy one-process mode. |
coderai --engine-only --internal-port N | Internal engine mode normally managed by the front supervisor. |
Engine routing and heterogeneous GPUs
Auto-detection favours NVIDIA for one-engine-per-GPU, while mixed systems can declare server.engine_specs with backend, env, and capability settings. A transformers/safetensors model goes to a transformers-capable engine; GGUF can route to a compatible NVIDIA or Vulkan engine.
- Per-model engine pin if compatible.
- Already-resident model to avoid reloads.
- Configured default engine.
- Least-loaded compatible engine.
Resource management
VRAM / RAM / disk offload
Models can offload based on per-model settings, GPU limits, and server-wide host RAM caps.
Queues
Requests are queued and processed per model/engine. Concurrency limits can be defaulted and overridden per engine.
Prompt cache and aggregation
Prompt caching reuses KV cache; prompt aggregation can batch concurrent requests into one inference pass.
Thermal protection
CPU heat pauses all engines; GPU heat pauses only the owning engine, with per-vendor threshold overrides.
Admin, auth, archive
- Web sessions use signed cookies; API clients use bearer tokens.
- Generated files can be auto-saved to an archive with retention such as
1h,1d,1w,1m,1y, ornever. - The archive can be browsed and deleted via Web Studio and API.
- Default admin credentials are for first boot only and should be changed immediately.
Troubleshooting checklist
- Model returns 503: check model id, backend compatibility, VRAM/RAM limits, and whether the engine is cooling or wedged.
- Vulkan unavailable: verify drivers, ICD files, and
GGML_VK_VISIBLE_DEVICES. - stable-diffusion.cpp uses CPU: verify CUDA/Vulkan build flags and runtime libraries.
- Broker connected but no routed work: check
provider_id,client_id, owner scope, and registration token. - UI hangs during generation: prefer front/engine split rather than single-process mode.
AISBF