LLMBase | Docs

Inference

Prompt caching

Use prompt_cache_key and cache-read pricing for repeated long prompts.

Updated


LLMBase applies prompt_cache_key on models that support OpenAI-compatible prompt caching. Use a stable key per application, user, workspace, or long-lived agent conversation when the beginning of the prompt stays mostly the same.

Prompt caching is automatic on supported models after the cache is warm. The first request normally writes or warms the cache and is billed as regular input. Later requests with the same stable prompt prefix can return usage.prompt_tokens_details.cached_tokens. Those cached prompt tokens are then billed at the model’s input_cache_read price when that price is configured.

Subscription-agent requests use the same prompt-cache behavior. LLMBase adds a stable prompt_cache_key for agent calls when the client does not provide one, so compatible models can reuse repeated long prompt prefixes.

Basic request

{
  "model": "deepseek/deepseek-v4-flash",
  "prompt_cache_key": "workspace-123-agent",
  "messages": [
    { "role": "system", "content": "You are a coding agent for this repository..." },
    { "role": "user", "content": "Continue from the last task." }
  ]
}

Keep the cacheable prefix stable. Put changing instructions, timestamps, temporary files, and user questions later in the message list so repeated requests can reuse the cached prefix. Short prompts may not be cached by the selected model.

Prompt layout for cache hits

Good cache candidates are long, repeated prefixes:

[
  { "role": "system", "content": "Long product policy, coding rules, or workspace instructions..." },
  { "role": "user", "content": "Current short task goes here." }
]

Avoid putting timestamps, request IDs, random examples, or the latest user message before a large static prompt. If the first part of the prompt changes on every request, the selected model cannot reuse the cached prefix.

Cache usage and billing

The response usage object reports cache hits:

{
  "usage": {
    "prompt_tokens": 4096,
    "completion_tokens": 256,
    "total_tokens": 4352,
    "prompt_tokens_details": {
      "cached_tokens": 3072
    }
  }
}

If cached_tokens is absent or 0, the request did not receive a cache hit. If cached tokens are reported but the selected model has no input_cache_read price, LLMBase bills those tokens at the normal input-token price.

Use the model metadata endpoint to read token prices before you run a workload:

curl "https://api.llmbase.ai/v1/models?metadata=true" \
  -H "Authorization: Bearer $LLMBASE_API_KEY"

For a rough estimate:

cost = input_tokens * pricing.prompt
     + cached_input_tokens * pricing.input_cache_read
     + output_tokens * pricing.completion

When pricing.input_cache_read is missing, cached input tokens use the normal input-token price.

Model table

Models with cache-read pricing include pricing.input_cache_read in metadata and show a cache-read price in the live Models table.