Elastic Inference API

Elastic Inference is Hypervize’s serverless offering. You send standard chat completion requests and we route them across our global inference network with no infrastructure on your part.

Endpoint

TEXT

POST https://hypervize.tech/api/chat/completions

All requests must include a valid Authorization: Bearer hvz_live_... header (or be made from the dashboard while logged in with at least one active key).

Request Format

The API is OpenAI Chat Completions compatible. The following fields are currently supported:

JSON

{
  "model": "claude-sonnet-4.6",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Hello!" }
  ],
  "max_tokens": 1024,
  "temperature": 0.7,
  "top_p": 0.9,
  "stream": true,
  "stop": ["\n\n"]
}

Important Model Name Handling

You should generally use the human-friendly display names from the catalog (e.g., claude-sonnet-4.6, llama-3-3-70b-instruct).

The Hypervize proxy will automatically translate these to the correct backend identifiers. You can also pass the raw provider identifier if you prefer.

See Supported Models for the full list with pricing and context windows.

Streaming Responses (Recommended)

Hypervize Inference strongly encourages streaming for all production use cases.

Responses are standard Server-Sent Events:

TEXT

data: {"id":"...","choices":[{"delta":{"content":"Hello"},"index":0}],"model":"...","created":...}

data: {"id":"...","choices":[{"delta":{"content":" world"},"index":0}],"model":"...","created":...}

...

data: {"usage":{"prompt_tokens":12,"completion_tokens":48,"total_tokens":60}}

data: [DONE]

The final usage object (when present) contains the token counts used for metering.

Non-Streaming Responses

You may set "stream": false (or omit the field). In this case you will receive a single JSON response containing the full assistant message and usage statistics.

Streaming is preferred for lower latency and better user experience.

Supported Features (MVP)

Chat messages (system / user / assistant, including multi-turn)
Vision / image inputs (on models that support them)
Streaming + non-streaming
max_tokens, temperature, top_p, stop
Token usage reporting in the final chunk

Not Yet Supported (or Limited)

Tool calling / function calling (roadmap via the agentic work)
JSON mode / response format enforcement
Logprobs
Seed / deterministic sampling guarantees
Audio / speech inputs (supported on certain models; availability varies)

Pricing (Elastic)

Pricing is transparent and usage-based. You are charged per million input and output tokens (or per image for vision models).

Current pricing for each model is visible in:

The model selector in the dashboard
The Supported Models page
The final usage chunk (we log estimated cost in the proxy)

Error Handling

All errors return a JSON body with an error field:

JSON

{
  "error": "Invalid API key"
}

Common status codes:

401 — Missing or invalid Authorization header / key
403 — Valid key but no inference keys on account, or other permission issue
500 — Internal error (rare — check status page or contact support)

Full error reference: Errors & Troubleshooting

Best Practices

Always stream in production UIs.
Handle the usage object for cost attribution.
Set reasonable max_tokens values (many models have different defaults).
Use the display names from the catalog rather than raw provider IDs for future-proofing.
Implement exponential backoff on 429/5xx responses.

Elastic Inference API

Elastic Inference API

Endpoint

Request Format

Important Model Name Handling

Streaming Responses (Recommended)

Non-Streaming Responses

Supported Features (MVP)

Not Yet Supported (or Limited)

Pricing (Elastic)

Error Handling

Best Practices

Related