Hypervize Inference Overview

Hypervize Inference gives you two complementary ways to run large language models in production:

Product	Type	Best For	Provisioning	Scaling	Pricing Model
Elastic Inference	Serverless	Prototyping, variable traffic, no ops	Instant	Automatic	Per-token
Dedicated Inference	Self-hosted	Production workloads, cost control, custom models	1–5 minutes	Auto-scaling	Per-GPU-hour

Both products expose the exact same OpenAI-compatible Chat Completions API (/v1/chat/completions style) with streaming support.

Elastic Inference

Powered by Hypervize’s global serverless inference network.
37 models from Anthropic, Meta, Mistral, DeepSeek, Qwen, NVIDIA, Google, Stability AI, and others (see full catalog in Supported Models).
No infrastructure to manage. Pay only for what you use.
Ideal for getting started quickly or handling spiky/unpredictable traffic.

Endpoint: POST https://hypervize.tech/api/chat/completions

You provision private dedicated endpoints running on Hypervize-managed high-performance inference infrastructure.
Full control over the model (any Hugging Face model or your own weights).
Predictable hourly pricing based on GPU type and count.
Supports auto-scaling including true scale-to-zero (set Min Nodes = 0). Endpoints show "SLEEPING" when hibernated and "WAKING UP" when a request brings them back online. Cold starts are fast thanks to model caching.
Best for steady production traffic, fine-tuned models, or when you want to optimize cost per token at scale.

Endpoint pattern: POST https://hypervize.tech/api/d/{your-endpoint-id}/chat/completions

Regardless of which product you use:

One API format — standard OpenAI messages, streaming (SSE), usage objects.
One authentication model — API keys generated in the dashboard.
One dashboard — manage keys, view usage, deploy dedicated endpoints, and test models.
Consistent metering — all traffic flows through Hypervize’s proxy layer for logging and future billing.

Use Elastic when you want:

Use Dedicated when you want:

Lowest cost per token at high volume
Private endpoints (or public with your own auth layer)
Specific model versions or fine-tunes
Predictable spend and the ability to scale out on your own schedule
Custom context lengths or inference parameters not exposed in the serverless catalog

Many teams start on Elastic and later move production workloads to Dedicated (or run both).

This documentation covers the MVP launch of the inference products only.

Included:

Elastic inference via the unified proxy
Self-service dedicated endpoint provisioning with intelligent hardware selection and auto-scaling
API key management
Dashboard tooling for testing and monitoring
Streaming responses with usage data

Not yet included (or intentionally de-scoped for MVP):

Full Stripe billing & automatic charges (metering is logged; enforcement is in progress)
Hard free-tier query limits (column exists in DB; proxy enforcement pending)
One-click RAG / tool packages (see agentic roadmap)
Public marketplace of shared endpoints
CLI or Terraform provider (planned)

TEXT


Now quickstart — this is the most important file for launch conversion.