Architecture & Data Flow
How Hypervize Inference (Elastic + Dedicated) is implemented end-to-end.
Architecture & Data Flow
This document gives you visibility into the production architecture for the inference MVP. It is intentionally higher-level than internal runbooks.
High-Level Diagram
flowchart TD
Client[Client / SDK / App]
-->|OpenAI format + Bearer key| EdgeProxy[Hypervize Edge Proxy]
EdgeProxy -->|Authentication, model routing, usage metering| ElasticNetwork[Elastic Inference Network]
EdgeProxy -->|Dedicated endpoint routing| DedicatedPlatform[Dedicated Inference Platform]
DedicatedPlatform -->|Managed high-performance runtime| UserEndpoint[Your Dedicated Endpoint]
subgraph "Control Plane"
Dashboard[Hypervize Dashboard]
ProvisionAPI[/api/provision]
ControlDB[(Control Plane Database)]
end
Dashboard --> ProvisionAPI
ProvisionAPI --> DedicatedPlatform
ProvisionAPI --> ControlDB
LifecycleSystem[Lifecycle & Notification System]
DedicatedPlatform --> LifecycleSystem
LifecycleSystem --> ControlDB
LifecycleSystem -->|Email notifications| UserElastic Path (Most Common)
- Client sends a standard OpenAI-compatible request to
/api/chat/completions. - The Edge Proxy performs authentication (API key or session) and resolves the model against the catalog.
- The request is routed over Hypervize’s global Elastic Inference Network.
- The response stream (with token usage metadata) flows back through the proxy for metering and normalization.
- The client receives standard SSE chunks.
All authentication, model translation, and usage metering are handled centrally at the proxy layer. This gives us consistent observability and a single place to apply future features (rate limits, tool routing, etc.).
Dedicated Path
- You submit a provisioning request through the dashboard or API (
POST /api/provision). - Hypervize performs intelligent hardware sizing based on the model (using metadata when available, with name-based fallbacks).
- We provision and configure dedicated GPU capacity with auto-scaling policies on your behalf. The endpoint starts in
buildingstatus. - You receive the permanent
endpt-xxxidentifier and callable URL immediately. - Our managed inference runtime boots the model with your configuration (including any provided Hugging Face token for gated models).
- A background lifecycle system monitors the endpoint and transitions it to
up(orfailed). You receive an email notification on terminal states. - Once online, traffic to
/api/d/endpt-xxx/chat/completionsis routed directly to your dedicated capacity with the same streaming semantics as Elastic.
Fast Cold Starts & Scale-to-Zero (Dedicated)
Dedicated endpoints support scale-to-zero. When there's no traffic for a while the endpoint hibernates (you'll see SLEEPING in the dashboard). The next request wakes it up (WAKING UP status).
We keep a secure cache of your model weights so wake-ups are much faster than a full cold start. In the playground you'll see a clear message while it's waking. Most users experience quick resume times once the hardware is allocated.
This lets you run dedicated capacity cost-effectively for variable workloads.
Data Stores & Infrastructure
Hypervize maintains a control plane database that stores users, API keys, endpoint metadata, autoscaling configurations, pricing, and telemetry settings.
Model weights for dedicated endpoints are stored in secure, customer-isolated storage with a high-performance caching layer optimized for fast loading.
Logs and metrics for your dedicated endpoints are collected and made available through the dashboard.
Security Model (MVP)
- All inference traffic (Elastic and private Dedicated) is authenticated at the Hypervize Edge Proxy.
- Dedicated endpoints run on isolated, Hypervize-managed GPU capacity with strict tenancy boundaries.
- No customer prompts or completions are retained beyond what is required for metering, debugging, and abuse prevention.
- API keys are stored securely and can be revoked instantly with no grace period.
Observability (Current)
- The Edge Proxy logs all usage events (with keys redacted).
- Dedicated endpoints surface container logs and basic telemetry through the dashboard.
- Additional GPU-level metrics and advanced monitoring are on the post-MVP roadmap.
Future Directions (Post-MVP)
- LiteLLM or custom router for more sophisticated model routing
- Full agentic tool execution loop (see
agentic_plan.md) - Multi-region + multi-cloud dedicated options
- Customer-managed keys / private VPC endpoints
This architecture has been deliberately kept focused and maintainable for the MVP while still delivering on the core promise of a unified, high-performance inference platform with both serverless and dedicated options.