INFERENCE DOCUMENTATION

Elastic Inference API

OpenAI-compatible API for 37 models through Hypervize’s global serverless inference network.

Elastic Inference API

Elastic Inference is Hypervize’s serverless offering. You send standard chat completion requests and we route them across our global inference network with no infrastructure on your part.


Endpoint

TEXT
POST https://hypervize.tech/api/chat/completions

All requests must include a valid Authorization: Bearer hvz_live_... header (or be made from the dashboard while logged in with at least one active key).


Request Format

The API is OpenAI Chat Completions compatible. The following fields are currently supported:

JSON
{
  "model": "claude-sonnet-4.6",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Hello!" }
  ],
  "max_tokens": 1024,
  "temperature": 0.7,
  "top_p": 0.9,
  "stream": true,
  "stop": ["\n\n"]
}

Important Model Name Handling

You should generally use the human-friendly display names from the catalog (e.g., claude-sonnet-4.6, llama-3-3-70b-instruct).

The Hypervize proxy will automatically translate these to the correct backend identifiers. You can also pass the raw provider identifier if you prefer.

See Supported Models for the full list with pricing and context windows.


Streaming Responses (Recommended)

Hypervize Inference strongly encourages streaming for all production use cases.

Responses are standard Server-Sent Events:

TEXT
data: {"id":"...","choices":[{"delta":{"content":"Hello"},"index":0}],"model":"...","created":...}

data: {"id":"...","choices":[{"delta":{"content":" world"},"index":0}],"model":"...","created":...}

...

data: {"usage":{"prompt_tokens":12,"completion_tokens":48,"total_tokens":60}}

data: [DONE]

The final usage object (when present) contains the token counts used for metering.


Non-Streaming Responses

You may set "stream": false (or omit the field). In this case you will receive a single JSON response containing the full assistant message and usage statistics.

Streaming is preferred for lower latency and better user experience.


Supported Features (MVP)

  • Chat messages (system / user / assistant, including multi-turn)
  • Vision / image inputs (on models that support them)
  • Streaming + non-streaming
  • max_tokens, temperature, top_p, stop
  • Token usage reporting in the final chunk

Not Yet Supported (or Limited)

  • Tool calling / function calling (roadmap via the agentic work)
  • JSON mode / response format enforcement
  • Logprobs
  • Seed / deterministic sampling guarantees
  • Audio / speech inputs (supported on certain models; availability varies)

Pricing (Elastic)

Pricing is transparent and usage-based. You are charged per million input and output tokens (or per image for vision models).

Current pricing for each model is visible in:

  • The model selector in the dashboard
  • The Supported Models page
  • The final usage chunk (we log estimated cost in the proxy)

Error Handling

All errors return a JSON body with an error field:

JSON
{
  "error": "Invalid API key"
}

Common status codes:

  • 401 — Missing or invalid Authorization header / key
  • 403 — Valid key but no inference keys on account, or other permission issue
  • 500 — Internal error (rare — check status page or contact support)

Full error reference: Errors & Troubleshooting


Best Practices

  1. Always stream in production UIs.
  2. Handle the usage object for cost attribution.
  3. Set reasonable max_tokens values (many models have different defaults).
  4. Use the display names from the catalog rather than raw provider IDs for future-proofing.
  5. Implement exponential backoff on 429/5xx responses.

Related

Was this helpful?Send feedback