INFERENCE DOCUMENTATION

Errors, Statuses & Troubleshooting

Common error responses, endpoint statuses, and how to debug inference issues.

Errors, Statuses & Troubleshooting


Error Response Format

All error responses use a consistent shape:

JSON
{
  "error": "Human readable message"
}

HTTP status codes are meaningful.


Common Errors (Elastic & Dedicated)

StatusError MessageCause / Fix
401Invalid API keyKey missing, malformed, or revoked. Regenerate.
401Missing or invalid Authorization headerNo Bearer token when the endpoint requires auth.
403You must generate an API key first to use inference.Logged-in session user has zero active keys. Create one.
403Unauthorized access to dedicated endpointTrying to call an endpt- ID you do not own.
403Forbidden: Access denied.Using a key that does not own the dedicated endpoint (only applies to /api/d/{id}/... routes).
500Failed to communicate with inference engineUpstream error from the inference network. Retry with backoff.
500Internal Server ErrorUnexpected failure in the proxy. Check logs / contact support.

Dedicated Endpoint Statuses

See the Dedicated Inference page for the full table.

Key ones for debugging:

  • building — Normal during first 2–10 minutes. Do not call yet.
  • failed — Look at the Logs tab in the dashboard. Common causes: model too large for chosen hardware, gated model without token, container crash.
  • up — Healthy.

Debugging Checklist

Elastic Calls Failing

  1. Verify the key is active in Settings → Keys.
  2. Confirm you are using a valid display name from the catalog.
  3. Try the exact same request in the dashboard playground (isolates client vs server issues).
  4. Check that your client properly handles SSE (many issues are on the consumer side).

Dedicated Endpoint Not Responding

  1. Confirm status is up in the dashboard.
  2. Check the Logs tab for runtime startup errors or out-of-memory conditions.
  3. If you used a gated model, confirm the HF token was valid at provisioning time.
  4. Try the in-dashboard playground bound to that endpoint ID — it removes network variables.

Slow First Token / High Latency

  • Large models on Elastic can have occasional cold-start latency on first request.
  • Dedicated endpoints that have scaled to zero ("SLEEPING") will wake on first request ("WAKING UP"). You'll see a clear message in the playground; allow a short delay before retrying. Our S3 cache makes this much faster than a full cold start.
  • Very high max_tokens or complex prompts increase time-to-first-token.

Streaming Client Issues

If you are not seeing tokens:

  • Make sure you are reading the response as a stream and parsing lines that start with data: .
  • Do not buffer the entire response.
  • Handle data: [DONE]\n\n as the terminator.
  • Some frameworks (especially older fetch wrappers) have poor SSE support — consider using a dedicated library (eventsource, openai SDK, etc.).

See Code Examples for robust client patterns.


Still Stuck?

  • Check the dashboard logs for dedicated endpoints first.
  • For Elastic, the proxy logs detailed errors server-side (we monitor these).
  • Use the enterprise contact form for production-impacting issues.

We treat inference reliability as the highest priority for the MVP.

Was this helpful?Send feedback