Dedicated Inference Endpoints

Dedicated Inference lets you run any Hugging Face model (or your own weights) on private, Hypervize-managed GPU capacity with full control over hardware tier, scaling behavior, and access policies.

How It Works

You provide a Hugging Face model ID (and optional HF token for gated models).
Hypervize analyzes the model size and provisions appropriate GPU hardware.
We provision dedicated capacity running our optimized inference runtime.
Auto-scaling policies are attached (target tracking on invocations + configurable min/max).
You receive a permanent endpoint ID and a stable URL.

After creation, the endpoint behaves like any other OpenAI-compatible inference endpoint.

Endpoint URL Pattern

TEXT

POST https://hypervize.tech/api/d/{endpoint-id}/chat/completions

Example:

TEXT

https://hypervize.tech/api/d/endpt-a1b2c3d4e5f6/chat/completions

This is the same format whether the endpoint is private (requires your API key) or public.

Public vs Private Endpoints

During creation you choose the auth mode:

Private (default): Every call must include a valid Authorization: Bearer hvz_live_... header belonging to the owner of the endpoint.
Public: No Authorization header is required. Useful for demos, client-side apps, or when you want to add your own auth layer on top.

Even public endpoints are still owned by you and appear in your dashboard with full telemetry and billing attribution.

Lifecycle & Statuses

Dedicated endpoints have these user-facing states:

Status	What it means for you	Can you call it?	What you'll see
`building`	We're provisioning the hardware and loading your model	No	"PROVISIONING" badge with spinning indicator
`up`	Ready and accepting requests. Auto-scaling is active.	Yes	"ONLINE" with green pulse
`hibernated`	Scale-to-zero is active — the endpoint is sleeping to save cost	Yes (triggers wake)	"SLEEPING"
`waking`	A request came in; the endpoint is spinning up from sleep	Brief delay	"WAKING UP" badge (cyan). You may see a temporary message in the playground that it's waking.
`failed`	Something went wrong during provisioning or runtime	No	"FAILED" in red
`disabled`	You (or we) disabled it	No	Rare

Scale to Zero

When you provision with Min Nodes = 0, your endpoint can automatically hibernate after a period of inactivity (usually ~1 hour of no traffic). This stops the underlying compute so you stop paying for it.

Look for the SLEEPING status in your dashboard and fleet list.
The first request after sleep will wake it up (you'll briefly see WAKING UP).
Cold starts are fast thanks to our persistent model cache, but expect a short delay (typically under a minute once the hardware is allocated).
While waking, the playground and API will surface a clear message so you know what's happening instead of a confusing error.

You can monitor current active nodes and hourly burn on the endpoint detail page — these update based on actual running capacity.

Scale-to-zero is great for dev, staging, or spiky production workloads.

Calling a Dedicated Endpoint

The request body is identical to Elastic:

JSON

{
  "model": "endpt-a1b2c3d4e5f6",   // or omit — the URL already identifies it
  "messages": [...],
  "max_tokens": 2048,
  "stream": true
}

You can still pass a model field; the dedicated proxy will ignore it in favor of the endpoint’s configured model.

Monitoring & Logs

Each dedicated endpoint has a detail page in the dashboard (/dashboard/inference/blocks/{id}) with:

Current status and pricing burn rate
Live logs and telemetry from the inference runtime
Telemetry (when enabled)
In-dashboard playground for quick testing
Configuration (min/max nodes, scale-to-zero, custom domain, etc.)

Custom Domains

You can attach a custom domain at creation time (or later via support ticket in MVP).

We create a DNS + ACM entry.
Your endpoint becomes available at https://yourdomain.com/v1/chat/completions (or the path you prefer).
Custom domain cost is added to your hourly rate.

Pricing Model

Dedicated endpoints are billed per GPU-hour based on the underlying instance type, plus small add-ons for:

Logging & telemetry
Custom domain (if used)

Current rates are shown live in the deployment form (they are derived from the model size + our pricing matrix).

See Usage, Limits & Billing for more details.

Limitations (MVP)

Editing an existing endpoint (changing min/max, model, etc.) is not yet supported in the UI. You must delete and re-provision.
Scale-to-zero notifications and advanced autoscaling policies are still maturing.
Pre-warmed images for the fastest cold starts are partially rolled out.

When to Use Dedicated vs Elastic

Scenario	Recommended
Experimenting with many models	Elastic
Steady high-volume production traffic	Dedicated
Fine-tuned or private weights	Dedicated
Need custom context length or advanced inference parameters	Dedicated
Want zero operational burden	Elastic
Public-facing demo or widget	Dedicated (public mode)

Dedicated Inference Endpoints

Dedicated Inference Endpoints

How It Works

Endpoint URL Pattern

Public vs Private Endpoints

Lifecycle & Statuses

Calling a Dedicated Endpoint

Monitoring & Logs

Custom Domains

Pricing Model

Limitations (MVP)

When to Use Dedicated vs Elastic

Related Documentation