INFERENCE DOCUMENTATION

Code Examples

Ready-to-use examples for calling Hypervize Inference from various languages and frameworks.

Code Examples

All examples use the Elastic endpoint. For Dedicated, simply change the URL to /api/d/{endpoint-id}/chat/completions and ensure you have the correct auth.


cURL (Streaming)

BASH
curl -N https://hypervize.tech/api/chat/completions \
  -H "Authorization: Bearer $HVZ_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-large-3",
    "messages": [{"role": "user", "content": "Explain RAG in 3 bullet points"}],
    "max_tokens": 300,
    "stream": true
  }'

Python (OpenAI SDK — Recommended)

PYTHON
from openai import OpenAI

client = OpenAI(
    base_url="https://hypervize.tech/api",
    api_key="hvz_live_..."
)

stream = client.chat.completions.create(
    model="claude-sonnet-4.6",
    messages=[{"role": "user", "content": "Write a haiku about GPUs"}],
    max_tokens=150,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

JavaScript / TypeScript (fetch)

TS
const response = await fetch("https://hypervize.tech/api/chat/completions", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.HVZ_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "llama-3-3-70b-instruct",
    messages: [{ role: "user", content: "Hello" }],
    stream: true,
  }),
});

const reader = response.body?.getReader();
const decoder = new TextDecoder();

while (true) {
  const { value, done } = await reader!.read();
  if (done) break;
  const chunk = decoder.decode(value);
  // Parse SSE lines starting with "data: "
  console.log(chunk);
}

LangChain (Python)

PYTHON
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://hypervize.tech/api",
    api_key="hvz_live_...",
    model="qwen3-32b",
)

response = llm.invoke("Explain why streaming matters for LLMs")
print(response.content)

LlamaIndex

LlamaIndex works via the same OpenAI-compatible base URL. Set:

PYTHON
from llama_index.llms.openai import OpenAI

llm = OpenAI(
    api_base="https://hypervize.tech/api",
    api_key="hvz_live_...",
    model="nemotron-3-super-120b",
)

Notes for Production Clients

  • Always set a reasonable timeout / read_timeout.
  • Implement retry logic with exponential backoff on 429 / 5xx.
  • Parse usage from the final chunk for cost tracking.
  • Prefer the official OpenAI SDK when possible — it handles SSE edge cases well.

More Examples

Need an example for a specific framework (Vercel AI SDK, AutoGen, CrewAI, etc.)? Let us know — we are rapidly expanding this section.

Was this helpful?Send feedback