MVP LAUNCH
Hypervize Inference.
Production-grade inference. Two products. One API.
GET STARTED
Quickstart →
First call in under 2 minutes
SERVERLESS
Elastic Inference →
37 models, zero infrastructure
DEDICATED
Dedicated Endpoints →
Private capacity with auto-scaling
Hypervize Inference Overview
Hypervize Inference gives you two complementary ways to run large language models in production:
| Product | Type | Best For | Provisioning | Scaling | Pricing Model |
|---|---|---|---|---|---|
| Elastic Inference | Serverless | Prototyping, variable traffic, no ops | Instant | Automatic | Per-token |
| Dedicated Inference | Self-hosted | Production workloads, cost control, custom models | 1–5 minutes | Auto-scaling | Per-GPU-hour |
Both products expose the exact same OpenAI-compatible Chat Completions API (/v1/chat/completions style) with streaming support.
Elastic Inference
- Powered by Hypervize’s global serverless inference network.
- 37 models from Anthropic, Meta, Mistral, DeepSeek, Qwen, NVIDIA, Google, Stability AI, and others (see full catalog in Supported Models).
- No infrastructure to manage. Pay only for what you use.
- Ideal for getting started quickly or handling spiky/unpredictable traffic.
Endpoint: POST https://hypervize.tech/api/chat/completions
Dedicated Inference
- You provision private dedicated endpoints running on Hypervize-managed high-performance inference infrastructure.
- Full control over the model (any Hugging Face model or your own weights).
- Predictable hourly pricing based on GPU type and count.
- Supports auto-scaling including true scale-to-zero (set Min Nodes = 0). Endpoints show "SLEEPING" when hibernated and "WAKING UP" when a request brings them back online. Cold starts are fast thanks to model caching.
- Best for steady production traffic, fine-tuned models, or when you want to optimize cost per token at scale.
Endpoint pattern: POST https://hypervize.tech/api/d/{your-endpoint-id}/chat/completions
Unified Experience
Regardless of which product you use:
- One API format — standard OpenAI messages, streaming (SSE),
usageobjects. - One authentication model — API keys generated in the dashboard.
- One dashboard — manage keys, view usage, deploy dedicated endpoints, and test models.
- Consistent metering — all traffic flows through Hypervize’s proxy layer for logging and future billing.
When to Choose Which
Use Elastic when you want:
- Zero ops
- Access to frontier closed models (Claude, etc.) and the latest open models
- To experiment with many models quickly
- Variable or low traffic
Use Dedicated when you want:
- Lowest cost per token at high volume
- Private endpoints (or public with your own auth layer)
- Specific model versions or fine-tunes
- Predictable spend and the ability to scale out on your own schedule
- Custom context lengths or inference parameters not exposed in the serverless catalog
Many teams start on Elastic and later move production workloads to Dedicated (or run both).
Current MVP Scope
This documentation covers the MVP launch of the inference products only.
Included:
- Elastic inference via the unified proxy
- Self-service dedicated endpoint provisioning with intelligent hardware selection and auto-scaling
- API key management
- Dashboard tooling for testing and monitoring
- Streaming responses with usage data
Not yet included (or intentionally de-scoped for MVP):
- Full Stripe billing & automatic charges (metering is logged; enforcement is in progress)
- Hard free-tier query limits (column exists in DB; proxy enforcement pending)
- One-click RAG / tool packages (see agentic roadmap)
- Public marketplace of shared endpoints
- CLI or Terraform provider (planned)
Next Steps
- Get an API key →
- Make your first call in 2 minutes →
- Explore Elastic Inference API or Dedicated Inference Endpoints
TEXT
Now quickstart — this is the most important file for launch conversion.