Agent usage and rate limits

Understand the included Pro Chat/Agent budget, approximate request coverage, and burst guards for chat-agent access.

Updated June 15, 2026

Agent requests are included in Pro, but they are not unlimited. They consume the same included subscription budget as normal LLMBase Chat requests. Heavy agent usage therefore leaves less included budget for the same subscription period. If the included budget is exhausted, the API returns a budget error before running the model request.

Agent requests and normal LLMBase Chat requests draw from the same included Pro Chat/Agent budget. The dashboard shows budget used and left as percentages for the current billing period; it does not expose the internal budget amount.

Approx. short text requests per month

The included budget is not a fixed request quota. For short text chats, the monthly budget usually covers approximately:

Model class	Free	Starter	Pro
Small models	500-1,500	3,500-10,000	7,500-20,000
Medium models	Not included	700-2,000	1,500-4,000
Large models	Not included	Not included	150-800
Premium models	Not included	Not included	30-150

Estimates are for short text chats. They are not fixed request quotas; long context, tools, images, research, and high-output answers use more of the included budget. Agent keys require Pro, so the Free and Starter columns are shown for browser Chat planning.

Burst guards

Server-side burst controls still protect the service from short spikes and automation loops. These are hidden burst guards, not message quotas, and they do not replace the included subscription budget. Browser chat allows up to 20 requests per minute per signed-in account. Chat-agent access allows up to 60 requests per minute per account and 30 requests per minute per chat-agent key. They protect your account from accidental retry loops, leaked keys, and automation mistakes before they spend your included budget unexpectedly. They also protect LLMBase infrastructure and model services from abuse spikes, so reliable users are not affected by one runaway client. If you need higher per-user burst limits for a legitimate workflow, contact support; we can review the account and increase the limit per user without changing the public default.

If a burst guard rejects a request, the API returns 429 with rate_limit_exceeded and asks the client to wait one minute before retrying. Cloudflare WAF rules are still useful for coarse IP-level abuse filtering and dashboard visibility, while LLMBase keeps Workers rate limits for authenticated user and chat-agent key checks.

Production workloads

For deterministic production jobs, background workers, or usage that must scale with exact cost accounting, use the direct Inference API instead of the Pro agent subscription.

Agent traffic uses the same OpenAI-compatible behavior, usage telemetry, and prompt-cache support as direct inference. LLMBase automatically adds a stable prompt_cache_key for subscription-agent calls when the client does not provide one.

Clients may still send their own prompt_cache_key if they need a stable cache namespace per workspace, repository, or long-running agent thread.

Starter and Free users still cannot use chat-agent keys, even if they have prepaid inference credits. For production jobs, unattended background workers, or customer-facing API products, use the direct Inference API with a llmbase_... inference key.