Errors, Statuses & Troubleshooting

Error Response Format

All error responses use a consistent shape:

JSON

{
  "error": "Human readable message"
}

HTTP status codes are meaningful.

Status	Error Message	Cause / Fix
401	Invalid API key	Key missing, malformed, or revoked. Regenerate.
401	Missing or invalid Authorization header	No `Bearer` token when the endpoint requires auth.
403	You must generate an API key first to use inference.	Logged-in session user has zero active keys. Create one.
403	Unauthorized access to dedicated endpoint	Trying to call an `endpt-` ID you do not own.
403	Forbidden: Access denied.	Using a key that does not own the dedicated endpoint (only applies to `/api/d/{id}/...` routes).
500	Failed to communicate with inference engine	Upstream error from the inference network. Retry with backoff.
500	Internal Server Error	Unexpected failure in the proxy. Check logs / contact support.

See the Dedicated Inference page for the full table.

Key ones for debugging:

building — Normal during first 2–10 minutes. Do not call yet.
failed — Look at the Logs tab in the dashboard. Common causes: model too large for chosen hardware, gated model without token, container crash.
up — Healthy.

Verify the key is active in Settings → Keys.
Confirm you are using a valid display name from the catalog.
Try the exact same request in the dashboard playground (isolates client vs server issues).
Check that your client properly handles SSE (many issues are on the consumer side).

Confirm status is up in the dashboard.
Check the Logs tab for runtime startup errors or out-of-memory conditions.
If you used a gated model, confirm the HF token was valid at provisioning time.
Try the in-dashboard playground bound to that endpoint ID — it removes network variables.

Large models on Elastic can have occasional cold-start latency on first request.
Dedicated endpoints that have scaled to zero ("SLEEPING") will wake on first request ("WAKING UP"). You'll see a clear message in the playground; allow a short delay before retrying. Our S3 cache makes this much faster than a full cold start.
Very high max_tokens or complex prompts increase time-to-first-token.

If you are not seeing tokens:

Make sure you are reading the response as a stream and parsing lines that start with data: .
Do not buffer the entire response.
Handle data: [DONE]\n\n as the terminator.
Some frameworks (especially older fetch wrappers) have poor SSE support — consider using a dedicated library (eventsource, openai SDK, etc.).

See Code Examples for robust client patterns.

We treat inference reliability as the highest priority for the MVP.