INFERENCE DOCUMENTATION

Usage, Limits & Billing

Current state of metering, free tier, and billing for the inference MVP.

Usage, Limits & Billing (MVP)

This page is intentionally transparent about what is and is not yet enforced as of launch.


What Is Tracked Today

Every streaming response that contains a usage object is logged by the proxy with:

  • User ID
  • Model (or endpoint ID)
  • Input / output / total tokens
  • Estimated cost (using catalog pricing for Elastic)

These logs are currently the foundation for future billing.

Dedicated endpoints also track hourly burn via the pricing configs attached at provisioning time.


Free Tier

The database schema includes a free_queries_remaining column on the users table (default 3).

Current status: The column exists and is decremented in some internal paths, but hard enforcement in the inference proxy has not yet been implemented.

MVP Reality

  • You can currently make more than 3 inference calls with a new account.
  • We are logging usage and will backfill proper enforcement shortly after launch.
  • Marketing and onboarding copy should be careful not to over-promise the free tier until enforcement lands.

We will communicate the exact cutoff date once the proxy change is deployed.


Dedicated Endpoint Pricing

Billed on an hourly basis for the underlying dedicated GPU capacity while the endpoint is running.

With scale-to-zero (set Min Nodes = 0 at provisioning), the endpoint can hibernate when idle. You stop accruing compute charges during hibernation. You'll see "SLEEPING" status in the UI when this happens. The first request wakes it (brief "WAKING UP" state). Billing resumes only while it's active.

Add-ons (added to the hourly rate):

  • Logging/telemetry: +$0.15/hr
  • Custom domain: +$0.10/hr

You are shown a live estimate in the deployment form before you click Deploy.


Elastic Pricing

Transparent per-token pricing on our Elastic network. No hourly component.

Usage is reported in the final data: chunk of every streamed response.


Rate Limiting (MVP)

There is currently no per-key or per-user rate limiting applied at the Hypervize proxy layer for the inference MVP.

Platform-level protections exist, but you should implement client-side throttling and cost controls for production workloads.

Rate limiting + quotas are high on the post-MVP roadmap.


What Will Change Post-MVP

  • Real Stripe customer + metered billing for both products
  • Hard free tier cutoff with clear upgrade path
  • Per-key rate limits and burst allowances
  • Usage dashboards with historical graphs and cost attribution
  • Invoicing and credit support for larger customers

Honest Advice for Launch

If you are building customer-facing products on Hypervize Inference today:

  • Assume you will be responsible for your own usage tracking and cost control in the short term.
  • Set conservative max_tokens values.
  • Monitor the logs we emit for the first few weeks.
  • Budget conservatively for Dedicated endpoints (they run until scaled down).

We are moving as fast as possible on the billing surface because we know it is table stakes for a real platform.


Questions?

Reach out via the enterprise contact form or your account representative. We are happy to give early customers visibility into upcoming billing changes.

Was this helpful?Send feedback