AISBF Cluster Troubleshooting Playbook

Cluster troubleshooting

AISBF Cluster Troubleshooting Playbook

Diagnose common multi-node AISBF failures: inconsistent provider state, Redis prefix mistakes, cache confusion, token mismatches, and load balancer drift.

Try the Demo More tutorials

Start with the failure shape

Cluster bugs often look random because different nodes see different state. Before changing config, identify whether the problem follows a node, a user token, a route name, or a backend provider.

Symptom	Likely layer	First check
Every other request fails	Load balancer / one AISBF node	Pin requests to each node and compare health.
Dashboard change not visible to API	Database config	Confirm all nodes use the same MySQL database.
Cache hits return stale or cross-environment data	Redis key prefix	Check production and staging prefixes are different.
Token works on one endpoint but not another	User scope / auth config	Confirm username, token scope, and route prefix.

Node-by-node health check

for node in aisbf-1.internal aisbf-2.internal aisbf-3.internal; do
  echo "== $node =="
  curl -fsS "http://$node:17765/health" || echo "health failed"
  curl -fsS -H "Authorization: Bearer $AISBF_TOKEN"     "http://$node:17765/api/u/$AISBF_USER/models" | head -c 300
  echo
done

If one node differs, fix that node before touching route policy. Policy changes can hide infrastructure drift without solving it.

Shared MySQL checks

mysql -h mysql.internal -u aisbf -p aisbf -e '
  SELECT id, username, tier FROM users LIMIT 10;
  SELECT name, type, enabled FROM providers ORDER BY name LIMIT 20;
'

All AISBF nodes should use the same database host, database name, and migration level. If the dashboard writes to SQLite while API nodes read MySQL, routes will appear to vanish.

Redis and response-cache checks

redis-cli -h redis.internal -a "$AISBF_REDIS_PASSWORD" --scan --pattern 'aisbf:prod:*' | head
redis-cli -h redis.internal -a "$AISBF_REDIS_PASSWORD" --scan --pattern 'aisbf:response:*' | head

Production safety: never reuse the same Redis prefix for staging and production. If you must flush cache, flush only the AISBF prefix, not the entire Redis database.

Route smoke-test matrix

Keep a tiny matrix for the routes your apps actually call. It catches drift faster than browsing every dashboard page.

routes=(
  "autoselect:chat-default"
  "rotation:support-private-rotation"
  "autoselect:coding-default"
)
for model in "${routes[@]}"; do
  echo "Testing $model"
  curl -fsS -H "Authorization: Bearer $AISBF_TOKEN"     -H "Content-Type: application/json"     "$AISBF_BASE/api/u/$AISBF_USER/chat/completions"     -d '{"model":"'"$model"'","messages":[{"role":"user","content":"smoke"}]}' >/dev/null     && echo ok || echo failed
done

When to edit routes vs infrastructure

Edit routes

A provider is down, too slow, too expensive, or no longer appropriate for a workload.

Edit infrastructure

Nodes disagree, cache prefixes collide, auth differs per node, or health checks fail before reaching providers.

Do not mask drift

Routing around a broken node is useful during an incident, but leave a follow-up to repair the cluster.

Try AISBF

AISBF is open source and also available as a hosted service. During the current testing period, hosted Pro is available as unlimited access for €6/month or €60/year; subscribing helps fund continued AISBF development and infrastructure while this one-human project is still demo-stage.

Support the €6 / €60 Pro test plan View source code