Architecture & Data Flow

This document gives you visibility into the production architecture for the inference MVP. It is intentionally higher-level than internal runbooks.

High-Level Diagram

MERMAID

flowchart TD
    Client[Client / SDK / App] 
    -->|OpenAI format + Bearer key| EdgeProxy[Hypervize Edge Proxy]

    EdgeProxy -->|Authentication, model routing, usage metering| ElasticNetwork[Elastic Inference Network]

    EdgeProxy -->|Dedicated endpoint routing| DedicatedPlatform[Dedicated Inference Platform]

    DedicatedPlatform -->|Managed high-performance runtime| UserEndpoint[Your Dedicated Endpoint]

    subgraph "Control Plane"
        Dashboard[Hypervize Dashboard]
        ProvisionAPI[/api/provision]
        ControlDB[(Control Plane Database)]
    end

    Dashboard --> ProvisionAPI
    ProvisionAPI --> DedicatedPlatform
    ProvisionAPI --> ControlDB

    LifecycleSystem[Lifecycle & Notification System]
    DedicatedPlatform --> LifecycleSystem
    LifecycleSystem --> ControlDB
    LifecycleSystem -->|Email notifications| User

Elastic Path (Most Common)

Client sends a standard OpenAI-compatible request to /api/chat/completions.
The Edge Proxy performs authentication (API key or session) and resolves the model against the catalog.
The request is routed over Hypervize’s global Elastic Inference Network.
The response stream (with token usage metadata) flows back through the proxy for metering and normalization.
The client receives standard SSE chunks.

All authentication, model translation, and usage metering are handled centrally at the proxy layer. This gives us consistent observability and a single place to apply future features (rate limits, tool routing, etc.).

Dedicated Path

You submit a provisioning request through the dashboard or API (POST /api/provision).
Hypervize performs intelligent hardware sizing based on the model (using metadata when available, with name-based fallbacks).
We provision and configure dedicated GPU capacity with auto-scaling policies on your behalf. The endpoint starts in building status.
You receive the permanent endpt-xxx identifier and callable URL immediately.
Our managed inference runtime boots the model with your configuration (including any provided Hugging Face token for gated models).
A background lifecycle system monitors the endpoint and transitions it to up (or failed). You receive an email notification on terminal states.
Once online, traffic to /api/d/endpt-xxx/chat/completions is routed directly to your dedicated capacity with the same streaming semantics as Elastic.

Fast Cold Starts & Scale-to-Zero (Dedicated)

Dedicated endpoints support scale-to-zero. When there's no traffic for a while the endpoint hibernates (you'll see SLEEPING in the dashboard). The next request wakes it up (WAKING UP status).

We keep a secure cache of your model weights so wake-ups are much faster than a full cold start. In the playground you'll see a clear message while it's waking. Most users experience quick resume times once the hardware is allocated.

This lets you run dedicated capacity cost-effectively for variable workloads.

Data Stores & Infrastructure

Hypervize maintains a control plane database that stores users, API keys, endpoint metadata, autoscaling configurations, pricing, and telemetry settings.

Model weights for dedicated endpoints are stored in secure, customer-isolated storage with a high-performance caching layer optimized for fast loading.

Logs and metrics for your dedicated endpoints are collected and made available through the dashboard.

Security Model (MVP)

All inference traffic (Elastic and private Dedicated) is authenticated at the Hypervize Edge Proxy.
Dedicated endpoints run on isolated, Hypervize-managed GPU capacity with strict tenancy boundaries.
No customer prompts or completions are retained beyond what is required for metering, debugging, and abuse prevention.
API keys are stored securely and can be revoked instantly with no grace period.

Observability (Current)

The Edge Proxy logs all usage events (with keys redacted).
Dedicated endpoints surface container logs and basic telemetry through the dashboard.
Additional GPU-level metrics and advanced monitoring are on the post-MVP roadmap.

Future Directions (Post-MVP)

LiteLLM or custom router for more sophisticated model routing
Full agentic tool execution loop (see agentic_plan.md)
Multi-region + multi-cloud dedicated options
Customer-managed keys / private VPC endpoints

This architecture has been deliberately kept focused and maintainable for the MVP while still delivering on the core promise of a unified, high-performance inference platform with both serverless and dedicated options.