Inference
Rate limits
Direct Inference API request rate, balance, spend-cap, concurrency, and model availability limits.
Updated
The direct Inference API at https://api.llmbase.ai/v1 does not use a fixed
requests-per-second or requests-per-minute product quota in the API Worker.
Requests are gated by prepaid balance, optional API-key spend caps, model
limits, model backend availability, and an active-request concurrency guard.
Request rate
No fixed requests-per-second or requests-per-minute quota is enforced for direct Inference API keys. Instead, each account can keep up to 20 prepaid inference requests active at the same time.
That means your completed requests per minute depend on how long your model requests take:
| Average request duration | Approx. completed requests per minute |
|---|---|
| 1 second | 1,200 |
| 5 seconds | 240 |
| 10 seconds | 120 |
| 30 seconds | 40 |
Formula: 20 concurrent requests * 60 / average request duration in seconds.
For example, if your average request takes 5 seconds, 20 concurrent requests
allow about 240 completed requests per minute. If your average request takes 30
seconds, the same limit allows about 40 completed requests per minute.
Active request limit
Each account can have up to 20 active prepaid inference requests reserved at
the same time. The API Worker enforces this before proxying the request to a
model. If the account already has 20 active prepaid reservations, the API
returns 429 with too_many_concurrent_requests and the message
Too many active inference requests. Retry after current requests finish.
This is an account-level active-request guard, not a model-specific rate limit.
It is separate from the Chat and Agents API burst guards on llmbase.ai.
Active prepaid reservations expire after 10 minutes if they are not finalized or released. This protects accounts from stuck reservations after interrupted streams or client disconnects.
Balance and spend caps
Inference API keys spend prepaid credits. Before a request runs, LLMBase
estimates the maximum cost from the selected model, input size, requested output
tokens, and model output limit. If the account does not have enough available
prepaid balance for that reservation, the API returns 402 with
insufficient_balance.
If an API key has a monthly spend cap, reserved and already-spent cents for that
key are counted against the cap. When a new request would exceed the cap, the API
returns 402 with api_key_spend_cap_exceeded.
Model limits
Model metadata can include context and output limits, but per_request_limits
is null in the public model-list schema. Use each model’s context_length and
max_output_length fields from GET /v1/models?metadata=true to size requests.
When a request exceeds a model context or output limit, reduce the prompt, attachments, or requested output tokens and retry.
Model availability
If a selected model is temporarily rate-limited or unavailable behind the
managed inference route, LLMBase returns an OpenAI-compatible error such as
model_rate_limited or retries through eligible same-model routing where
supported. This is separate from the account-level concurrency, balance, and
spend-cap gates above.
For streaming requests, a temporary model backend failure can return:
{"error":{"message":"LLMBase model backend unavailable","type":"server_error"}}
This 502 response means the request reached LLMBase but the selected model
backend was unavailable while the request was running. It is
not the active-request concurrency guard and it is not an API-key spend-cap
rejection. Retry with exponential backoff and reduce parallelism if failures
cluster around long-running streams. For DeepSeek agent loops, use
deepseek/deepseek-v4-flash for normal high-throughput coding iterations and
reserve deepseek/deepseek-v4-pro for harder long-context reasoning steps.