INFERENCE DOCUMENTATION

Dedicated Inference Endpoints

Deploy and manage your own private dedicated inference endpoints with auto-scaling and custom domains.

Dedicated Inference Endpoints

Dedicated Inference lets you run any Hugging Face model (or your own weights) on private, Hypervize-managed GPU capacity with full control over hardware tier, scaling behavior, and access policies.


How It Works

  1. You provide a Hugging Face model ID (and optional HF token for gated models).
  2. Hypervize analyzes the model size and provisions appropriate GPU hardware.
  3. We provision dedicated capacity running our optimized inference runtime.
  4. Auto-scaling policies are attached (target tracking on invocations + configurable min/max).
  5. You receive a permanent endpoint ID and a stable URL.

After creation, the endpoint behaves like any other OpenAI-compatible inference endpoint.


Endpoint URL Pattern

TEXT
POST https://hypervize.tech/api/d/{endpoint-id}/chat/completions

Example:

TEXT
https://hypervize.tech/api/d/endpt-a1b2c3d4e5f6/chat/completions

This is the same format whether the endpoint is private (requires your API key) or public.


Public vs Private Endpoints

During creation you choose the auth mode:

  • Private (default): Every call must include a valid Authorization: Bearer hvz_live_... header belonging to the owner of the endpoint.
  • Public: No Authorization header is required. Useful for demos, client-side apps, or when you want to add your own auth layer on top.

Even public endpoints are still owned by you and appear in your dashboard with full telemetry and billing attribution.


Lifecycle & Statuses

Dedicated endpoints have these user-facing states:

StatusWhat it means for youCan you call it?What you'll see
buildingWe're provisioning the hardware and loading your modelNo"PROVISIONING" badge with spinning indicator
upReady and accepting requests. Auto-scaling is active.Yes"ONLINE" with green pulse
hibernatedScale-to-zero is active — the endpoint is sleeping to save costYes (triggers wake)"SLEEPING"
wakingA request came in; the endpoint is spinning up from sleepBrief delay"WAKING UP" badge (cyan). You may see a temporary message in the playground that it's waking.
failedSomething went wrong during provisioning or runtimeNo"FAILED" in red
disabledYou (or we) disabled itNoRare

Scale to Zero

When you provision with Min Nodes = 0, your endpoint can automatically hibernate after a period of inactivity (usually ~1 hour of no traffic). This stops the underlying compute so you stop paying for it.

  • Look for the SLEEPING status in your dashboard and fleet list.
  • The first request after sleep will wake it up (you'll briefly see WAKING UP).
  • Cold starts are fast thanks to our persistent model cache, but expect a short delay (typically under a minute once the hardware is allocated).
  • While waking, the playground and API will surface a clear message so you know what's happening instead of a confusing error.

You can monitor current active nodes and hourly burn on the endpoint detail page — these update based on actual running capacity.

Scale-to-zero is great for dev, staging, or spiky production workloads.


Calling a Dedicated Endpoint

The request body is identical to Elastic:

JSON
{
  "model": "endpt-a1b2c3d4e5f6",   // or omit — the URL already identifies it
  "messages": [...],
  "max_tokens": 2048,
  "stream": true
}

You can still pass a model field; the dedicated proxy will ignore it in favor of the endpoint’s configured model.


Monitoring & Logs

Each dedicated endpoint has a detail page in the dashboard (/dashboard/inference/blocks/{id}) with:

  • Current status and pricing burn rate
  • Live logs and telemetry from the inference runtime
  • Telemetry (when enabled)
  • In-dashboard playground for quick testing
  • Configuration (min/max nodes, scale-to-zero, custom domain, etc.)

Custom Domains

You can attach a custom domain at creation time (or later via support ticket in MVP).

  • We create a DNS + ACM entry.
  • Your endpoint becomes available at https://yourdomain.com/v1/chat/completions (or the path you prefer).
  • Custom domain cost is added to your hourly rate.

Pricing Model

Dedicated endpoints are billed per GPU-hour based on the underlying instance type, plus small add-ons for:

  • Logging & telemetry
  • Custom domain (if used)

Current rates are shown live in the deployment form (they are derived from the model size + our pricing matrix).

See Usage, Limits & Billing for more details.


Limitations (MVP)

  • Editing an existing endpoint (changing min/max, model, etc.) is not yet supported in the UI. You must delete and re-provision.
  • Scale-to-zero notifications and advanced autoscaling policies are still maturing.
  • Pre-warmed images for the fastest cold starts are partially rolled out.

When to Use Dedicated vs Elastic

ScenarioRecommended
Experimenting with many modelsElastic
Steady high-volume production trafficDedicated
Fine-tuned or private weightsDedicated
Need custom context length or advanced inference parametersDedicated
Want zero operational burdenElastic
Public-facing demo or widgetDedicated (public mode)

Related Documentation

Was this helpful?Send feedback