MVP LAUNCH

Hypervize Inference.

Production-grade inference. Two products. One API.

Hypervize Inference Overview

Hypervize Inference gives you two complementary ways to run large language models in production:

ProductTypeBest ForProvisioningScalingPricing Model
Elastic InferenceServerlessPrototyping, variable traffic, no opsInstantAutomaticPer-token
Dedicated InferenceSelf-hostedProduction workloads, cost control, custom models1–5 minutesAuto-scalingPer-GPU-hour

Both products expose the exact same OpenAI-compatible Chat Completions API (/v1/chat/completions style) with streaming support.


Elastic Inference

  • Powered by Hypervize’s global serverless inference network.
  • 37 models from Anthropic, Meta, Mistral, DeepSeek, Qwen, NVIDIA, Google, Stability AI, and others (see full catalog in Supported Models).
  • No infrastructure to manage. Pay only for what you use.
  • Ideal for getting started quickly or handling spiky/unpredictable traffic.

Endpoint: POST https://hypervize.tech/api/chat/completions


Dedicated Inference

  • You provision private dedicated endpoints running on Hypervize-managed high-performance inference infrastructure.
  • Full control over the model (any Hugging Face model or your own weights).
  • Predictable hourly pricing based on GPU type and count.
  • Supports auto-scaling including true scale-to-zero (set Min Nodes = 0). Endpoints show "SLEEPING" when hibernated and "WAKING UP" when a request brings them back online. Cold starts are fast thanks to model caching.
  • Best for steady production traffic, fine-tuned models, or when you want to optimize cost per token at scale.

Endpoint pattern: POST https://hypervize.tech/api/d/{your-endpoint-id}/chat/completions


Unified Experience

Regardless of which product you use:

  • One API format — standard OpenAI messages, streaming (SSE), usage objects.
  • One authentication model — API keys generated in the dashboard.
  • One dashboard — manage keys, view usage, deploy dedicated endpoints, and test models.
  • Consistent metering — all traffic flows through Hypervize’s proxy layer for logging and future billing.

When to Choose Which

Use Elastic when you want:

  • Zero ops
  • Access to frontier closed models (Claude, etc.) and the latest open models
  • To experiment with many models quickly
  • Variable or low traffic

Use Dedicated when you want:

  • Lowest cost per token at high volume
  • Private endpoints (or public with your own auth layer)
  • Specific model versions or fine-tunes
  • Predictable spend and the ability to scale out on your own schedule
  • Custom context lengths or inference parameters not exposed in the serverless catalog

Many teams start on Elastic and later move production workloads to Dedicated (or run both).


Current MVP Scope

This documentation covers the MVP launch of the inference products only.

Included:

  • Elastic inference via the unified proxy
  • Self-service dedicated endpoint provisioning with intelligent hardware selection and auto-scaling
  • API key management
  • Dashboard tooling for testing and monitoring
  • Streaming responses with usage data

Not yet included (or intentionally de-scoped for MVP):

  • Full Stripe billing & automatic charges (metering is logged; enforcement is in progress)
  • Hard free-tier query limits (column exists in DB; proxy enforcement pending)
  • One-click RAG / tool packages (see agentic roadmap)
  • Public marketplace of shared endpoints
  • CLI or Terraform provider (planned)

Next Steps

  1. Get an API key →
  2. Make your first call in 2 minutes →
  3. Explore Elastic Inference API or Dedicated Inference Endpoints
TEXT

Now quickstart — this is the most important file for launch conversion.