Provisioning Dedicated Endpoints
Detailed flow of how dedicated inference endpoints are created and managed.
Provisioning Dedicated Endpoints
This document explains exactly what happens when you click “Deploy” on a dedicated endpoint.
High-Level Flow
- Client Request →
POST /api/provision(authenticated via your dashboard session) - Hardware Sizing — Hypervize analyzes the model using live Hugging Face metadata when available, with intelligent name-based fallbacks.
- Infrastructure Provisioning — We allocate and configure dedicated GPU capacity with auto-scaling policies, container runtime, and networking on your behalf.
- Control Plane Registration — Records are created for the endpoint, autoscaling configuration, pricing, and optional custom domain / telemetry.
- Immediate Response — You receive the
endpointIdand stable URL right away (status starts asbuilding). - Background Lifecycle Management — Hypervize monitors the provisioning process, transitions the status (to
uporfailed), and sends email notifications on completion.
Hardware Selection Logic
Hypervize automatically analyzes the requested model and provisions appropriate dedicated GPU capacity.
- When possible, it queries the Hugging Face Hub for the model's safetensors size to estimate required VRAM.
- It then selects a suitable hardware tier (ranging from single A10G GPUs up to 8× Blackwell B200 configurations).
- A set of name-based heuristics also exists as a fallback when live metadata is unavailable or gated.
Storage volume size is intelligently sized based on the model footprint to ensure reliable loading.
What Hypervize Provisions
For each dedicated endpoint we configure and manage:
- A dedicated container running our high-performance inference runtime, tuned for the specific model and hardware tier.
- Auto-scaling policies that respect your min/max instance settings. Set Min Nodes to 0 to enable scale-to-zero (the endpoint will hibernate when idle and wake on the next request).
- Networking, health checking, and secure isolation boundaries.
- Integration with our central proxy layer so the endpoint appears at a stable
/api/d/{endpoint-id}URL with the same OpenAI-compatible contract as Elastic inference.
Control Plane Records
Hypervize records the endpoint, its autoscaling configuration (min/max nodes and scale-to-zero behavior), pricing details, and any custom domain or telemetry settings. This powers the dashboard view, usage reporting, and the stable endpoint URL.
Status Transitions (Expected)
building→up(ready)- For scale-to-zero (Min Nodes = 0):
up→hibernated(sleeping after inactivity) →waking(on first request) →up
You'll see clear status badges in the dashboard ("ONLINE", "SLEEPING", "WAKING UP", etc.). The system surfaces friendly messages in the playground when an endpoint is waking from sleep.
What You Receive Immediately
{
"success": true,
"endpointId": "endpt-a1b2c3d4e5f6",
"defaultUrl": "https://hypervize.tech/api/d/endpt-a1b2c3d4e5f6/chat/completions",
"message": "Endpoint provisioning initiated. You will receive an email when it is ready."
}The email is sent by the lifecycle system once the endpoint reaches a terminal state.
Current Limitations & Workarounds (MVP)
- No in-place editing of min/max or model after creation.
- Deleting an endpoint in the UI is not yet wired (must be done via support or manually in rare cases).
- Custom domain attachment is currently only available at creation time.
These will be addressed post-launch.
Related