INFERENCE DOCUMENTATION
Code Examples
Ready-to-use examples for calling Hypervize Inference from various languages and frameworks.
Code Examples
All examples use the Elastic endpoint. For Dedicated, simply change the URL to /api/d/{endpoint-id}/chat/completions and ensure you have the correct auth.
cURL (Streaming)
BASH
curl -N https://hypervize.tech/api/chat/completions \
-H "Authorization: Bearer $HVZ_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-large-3",
"messages": [{"role": "user", "content": "Explain RAG in 3 bullet points"}],
"max_tokens": 300,
"stream": true
}'Python (OpenAI SDK — Recommended)
PYTHON
from openai import OpenAI
client = OpenAI(
base_url="https://hypervize.tech/api",
api_key="hvz_live_..."
)
stream = client.chat.completions.create(
model="claude-sonnet-4.6",
messages=[{"role": "user", "content": "Write a haiku about GPUs"}],
max_tokens=150,
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")JavaScript / TypeScript (fetch)
TS
const response = await fetch("https://hypervize.tech/api/chat/completions", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.HVZ_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "llama-3-3-70b-instruct",
messages: [{ role: "user", content: "Hello" }],
stream: true,
}),
});
const reader = response.body?.getReader();
const decoder = new TextDecoder();
while (true) {
const { value, done } = await reader!.read();
if (done) break;
const chunk = decoder.decode(value);
// Parse SSE lines starting with "data: "
console.log(chunk);
}LangChain (Python)
PYTHON
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="https://hypervize.tech/api",
api_key="hvz_live_...",
model="qwen3-32b",
)
response = llm.invoke("Explain why streaming matters for LLMs")
print(response.content)LlamaIndex
LlamaIndex works via the same OpenAI-compatible base URL. Set:
PYTHON
from llama_index.llms.openai import OpenAI
llm = OpenAI(
api_base="https://hypervize.tech/api",
api_key="hvz_live_...",
model="nemotron-3-super-120b",
)Notes for Production Clients
- Always set a reasonable
timeout/read_timeout. - Implement retry logic with exponential backoff on 429 / 5xx.
- Parse usage from the final chunk for cost tracking.
- Prefer the official OpenAI SDK when possible — it handles SSE edge cases well.
More Examples
Need an example for a specific framework (Vercel AI SDK, AutoGen, CrewAI, etc.)? Let us know — we are rapidly expanding this section.
Was this helpful?Send feedback