LLMBase | Docs

Inference

Rate limits

Direct Inference API request rate, balance, spend-cap, concurrency, and model availability limits.

Updated


The direct Inference API at https://api.llmbase.ai/v1 does not use a fixed requests-per-second or requests-per-minute product quota in the API Worker. Requests are gated by prepaid balance, optional API-key spend caps, model limits, model backend availability, and an active-request concurrency guard.

Request rate

No fixed requests-per-second or requests-per-minute quota is enforced for direct Inference API keys. Instead, each account can keep up to 20 prepaid inference requests active at the same time.

That means your completed requests per minute depend on how long your model requests take:

Average request durationApprox. completed requests per minute
1 second1,200
5 seconds240
10 seconds120
30 seconds40

Formula: 20 concurrent requests * 60 / average request duration in seconds. For example, if your average request takes 5 seconds, 20 concurrent requests allow about 240 completed requests per minute. If your average request takes 30 seconds, the same limit allows about 40 completed requests per minute.

Active request limit

Each account can have up to 20 active prepaid inference requests reserved at the same time. The API Worker enforces this before proxying the request to a model. If the account already has 20 active prepaid reservations, the API returns 429 with too_many_concurrent_requests and the message Too many active inference requests. Retry after current requests finish.

This is an account-level active-request guard, not a model-specific rate limit. It is separate from the Chat and Agents API burst guards on llmbase.ai.

Active prepaid reservations expire after 10 minutes if they are not finalized or released. This protects accounts from stuck reservations after interrupted streams or client disconnects.

Balance and spend caps

Inference API keys spend prepaid credits. Before a request runs, LLMBase estimates the maximum cost from the selected model, input size, requested output tokens, and model output limit. If the account does not have enough available prepaid balance for that reservation, the API returns 402 with insufficient_balance.

If an API key has a monthly spend cap, reserved and already-spent cents for that key are counted against the cap. When a new request would exceed the cap, the API returns 402 with api_key_spend_cap_exceeded.

Model limits

Model metadata can include context and output limits, but per_request_limits is null in the public model-list schema. Use each model’s context_length and max_output_length fields from GET /v1/models?metadata=true to size requests.

When a request exceeds a model context or output limit, reduce the prompt, attachments, or requested output tokens and retry.

Model availability

If a selected model is temporarily rate-limited or unavailable behind the managed inference route, LLMBase returns an OpenAI-compatible error such as model_rate_limited or retries through eligible same-model routing where supported. This is separate from the account-level concurrency, balance, and spend-cap gates above.

For streaming requests, a temporary model backend failure can return:

{"error":{"message":"LLMBase model backend unavailable","type":"server_error"}}

This 502 response means the request reached LLMBase but the selected model backend was unavailable while the request was running. It is not the active-request concurrency guard and it is not an API-key spend-cap rejection. Retry with exponential backoff and reduce parallelism if failures cluster around long-running streams. For DeepSeek agent loops, use deepseek/deepseek-v4-flash for normal high-throughput coding iterations and reserve deepseek/deepseek-v4-pro for harder long-context reasoning steps.