AISBF Logo AISBF

AI Service Broker Framework — AI Should Be Free

CoderAI documentation · source-backed from Nexlab/coderai

CoderAI operations guide

Run CoderAI as local infrastructure: frontend/engine split, multi-GPU routing, queues, offload, thermal controls, archives, auth, and troubleshooting.

Frontend / engine split

CoderAI can boot as a thin public frontend plus one or more internal engine processes. The front imports no heavy torch/transformer/diffuser stacks, keeps the web UI responsive, streams HTTP/SSE to engines, and serves an aggregated status/tasks view.

client ─HTTP/SSE─▶ front (public) ─┬─ engine#0 (CUDA_VISIBLE_DEVICES=0, :8780)
                                  ├─ engine#1 (CUDA_VISIBLE_DEVICES=1, :8781)
                                  └─ ...
LaunchResult
coderaiFront on public port; auto-spawns one engine per GPU by default.
coderai --single-processLegacy one-process mode.
coderai --engine-only --internal-port NInternal engine mode normally managed by the front supervisor.

Engine routing and heterogeneous GPUs

Auto-detection favours NVIDIA for one-engine-per-GPU, while mixed systems can declare server.engine_specs with backend, env, and capability settings. A transformers/safetensors model goes to a transformers-capable engine; GGUF can route to a compatible NVIDIA or Vulkan engine.

  1. Per-model engine pin if compatible.
  2. Already-resident model to avoid reloads.
  3. Configured default engine.
  4. Least-loaded compatible engine.

Resource management

VRAM / RAM / disk offload

Models can offload based on per-model settings, GPU limits, and server-wide host RAM caps.

Queues

Requests are queued and processed per model/engine. Concurrency limits can be defaulted and overridden per engine.

Prompt cache and aggregation

Prompt caching reuses KV cache; prompt aggregation can batch concurrent requests into one inference pass.

Thermal protection

CPU heat pauses all engines; GPU heat pauses only the owning engine, with per-vendor threshold overrides.

Admin, auth, archive

  • Web sessions use signed cookies; API clients use bearer tokens.
  • Generated files can be auto-saved to an archive with retention such as 1h, 1d, 1w, 1m, 1y, or never.
  • The archive can be browsed and deleted via Web Studio and API.
  • Default admin credentials are for first boot only and should be changed immediately.

Troubleshooting checklist

  • Model returns 503: check model id, backend compatibility, VRAM/RAM limits, and whether the engine is cooling or wedged.
  • Vulkan unavailable: verify drivers, ICD files, and GGML_VK_VISIBLE_DEVICES.
  • stable-diffusion.cpp uses CPU: verify CUDA/Vulkan build flags and runtime libraries.
  • Broker connected but no routed work: check provider_id, client_id, owner scope, and registration token.
  • UI hangs during generation: prefer front/engine split rather than single-process mode.