INFERENCE DOCUMENTATION

Quickstart — First Inference Call

Get an API key and make your first streaming chat completion in under two minutes.

Quickstart — First Inference Call

This guide gets you from zero to a working streaming inference call as fast as possible.


1. Create an Account & Get an API Key

  1. Go to https://hypervize.tech and sign in (or create an account) using Auth0.

    • A default inference-scoped API key is automatically generated for you on signup (via database trigger).
  2. If needed, go to Settings → Keys (or directly to /dashboard/settings/keys) to view it or create additional keys.

  3. Give any new key a name (e.g., “Production – Elastic”) and choose the inference scope.

  4. Copy the key immediately. It will look like:

    TEXT
    hvz_live_3f8a9c2e1b4d5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c

Important: Treat this key like a password. It grants access to inference on your behalf.

You now have everything needed for the Elastic Inference API.


2. Make Your First Call (cURL)

Replace YOUR_API_KEY with the key you just generated.

BASH
curl https://hypervize.tech/api/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4.6",
    "messages": [
      {"role": "user", "content": "Explain scale-to-zero inference in one paragraph."}
    ],
    "max_tokens": 256,
    "stream": true
  }'

You should see Server-Sent Events (SSE) chunks starting with data: {...} and ending with data: [DONE].


3. Try a Different Model

The model field accepts the display name from our catalog (recommended) or the raw provider identifier.

Popular starting models:

  • claude-sonnet-4.6
  • llama-3-3-70b-instruct
  • mistral-large-3
  • deepseek-v3.2
  • qwen3-32b

See the full list in Supported Models.


4. (Optional) Use the Dashboard Playground

  1. Go to Dashboard → Inference.
  2. Switch to the Elastic tab.
  3. Select a model from the grouped dropdown.
  4. Type a prompt and hit Send.

The playground shows token usage, TTFB, and total latency — very useful while you’re learning the catalog.


5. Next: Try a Dedicated Endpoint

Once you’re comfortable with Elastic:

  1. Go to Dashboard → Inference.
  2. Switch to the Dedicated tab.
  3. Enter a Hugging Face model ID (e.g., meta-llama/Llama-3.1-8B-Instruct).
  4. Configure min/max instances and deploy.

After provisioning completes (status changes to ONLINE), you can call it at:

TEXT
https://hypervize.tech/api/d/{endpoint-id}/chat/completions

Dedicated endpoints support the identical request format as Elastic.


Troubleshooting First Calls

SymptomLikely CauseFix
401 Invalid API keyKey not copied correctly or revokedRegenerate in dashboard
403 You must generate an API key first to use inference.Session user has no active inference keys (note: a default key is auto-generated on signup)Create or reactivate an inference-scoped key in Settings → Keys
403 Unauthorized dedicatedUsing someone else’s endpoint IDOnly use IDs you own
No streaming / empty responseClient not handling SSE correctlyUse stream: true and read the stream
Slow first tokenCold start on a large modelUse smaller models for testing or provision a dedicated endpoint

What’s Next?

Was this helpful?Send feedback