AISBF Logo AISBF

AI Service Broker Framework — AI Should Be Free

Cluster troubleshooting

AISBF Cluster Troubleshooting Playbook

Diagnose common multi-node AISBF failures: inconsistent provider state, Redis prefix mistakes, cache confusion, token mismatches, and load balancer drift.

Start with the failure shape

Cluster bugs often look random because different nodes see different state. Before changing config, identify whether the problem follows a node, a user token, a route name, or a backend provider.

SymptomLikely layerFirst check
Every other request failsLoad balancer / one AISBF nodePin requests to each node and compare health.
Dashboard change not visible to APIDatabase configConfirm all nodes use the same MySQL database.
Cache hits return stale or cross-environment dataRedis key prefixCheck production and staging prefixes are different.
Token works on one endpoint but not anotherUser scope / auth configConfirm username, token scope, and route prefix.

Node-by-node health check

for node in aisbf-1.internal aisbf-2.internal aisbf-3.internal; do
  echo "== $node =="
  curl -fsS "http://$node:17765/health" || echo "health failed"
  curl -fsS -H "Authorization: Bearer $AISBF_TOKEN"     "http://$node:17765/api/u/$AISBF_USER/models" | head -c 300
  echo
done

If one node differs, fix that node before touching route policy. Policy changes can hide infrastructure drift without solving it.

Shared MySQL checks

mysql -h mysql.internal -u aisbf -p aisbf -e '
  SELECT id, username, tier FROM users LIMIT 10;
  SELECT name, type, enabled FROM providers ORDER BY name LIMIT 20;
'

All AISBF nodes should use the same database host, database name, and migration level. If the dashboard writes to SQLite while API nodes read MySQL, routes will appear to vanish.

Redis and response-cache checks

redis-cli -h redis.internal -a "$AISBF_REDIS_PASSWORD" --scan --pattern 'aisbf:prod:*' | head
redis-cli -h redis.internal -a "$AISBF_REDIS_PASSWORD" --scan --pattern 'aisbf:response:*' | head
Production safety: never reuse the same Redis prefix for staging and production. If you must flush cache, flush only the AISBF prefix, not the entire Redis database.

Route smoke-test matrix

Keep a tiny matrix for the routes your apps actually call. It catches drift faster than browsing every dashboard page.

routes=(
  "autoselect:chat-default"
  "rotation:support-private-rotation"
  "autoselect:coding-default"
)
for model in "${routes[@]}"; do
  echo "Testing $model"
  curl -fsS -H "Authorization: Bearer $AISBF_TOKEN"     -H "Content-Type: application/json"     "$AISBF_BASE/api/u/$AISBF_USER/chat/completions"     -d '{"model":"'"$model"'","messages":[{"role":"user","content":"smoke"}]}' >/dev/null     && echo ok || echo failed
done

When to edit routes vs infrastructure

Edit routes

A provider is down, too slow, too expensive, or no longer appropriate for a workload.

Edit infrastructure

Nodes disagree, cache prefixes collide, auth differs per node, or health checks fail before reaching providers.

Do not mask drift

Routing around a broken node is useful during an incident, but leave a follow-up to repair the cluster.

Try AISBF

AISBF is open source and also available as a hosted service. During the current testing period, hosted Pro is temporarily available as unlimited access for €2/month.