Inference
Models
Available models and how to list them via the API.
Updated
Available models
| Model ID | Input / 1M tokens | Output / 1M tokens | Context window |
|---|---|---|---|
deepseek/deepseek-v3.2 | $0.30 | $0.50 | 163,840 tokens |
deepseek/deepseek-v4-flash | $0.20 | $0.30 | 1,048,576 tokens |
deepseek/deepseek-v4-pro | $1.80 | $3.60 | 1,048,576 tokens |
google/gemma-4-26b-a4b-it | $0.13 | $0.40 | 262,144 tokens |
minimax/minimax-m2.5 | $0.30 | $1.20 | 196,608 tokens |
moonshotai/kimi-k2.5 | $0.60 | $2.80 | 262,144 tokens |
moonshotai/kimi-k2.6 | $0.90 | $4.00 | 262,144 tokens |
nvidia/nemotron-3-nano-omni-30b-a3b-reasoning | $0.30 | $1.00 | 262,144 tokens |
openai/gpt-oss-120b | $0.09 | $0.36 | 131,072 tokens |
paddlepaddle/paddleocr-vl-0.9b | $0.20 | $1.00 | 32,768 tokens |
qwen/qwen3-coder | $0.40 | $1.60 | 262,144 tokens |
qwen/qwen3.5-122b-a10b | $0.40 | $3.00 | 262,144 tokens |
qwen/qwen3.5-35b-a3b | $0.30 | $2.50 | 262,144 tokens |
qwen/qwen3.5-397b-a17b | $0.60 | $3.60 | 262,144 tokens |
qwen/qwen3.5-9b | $0.80 | $1.25 | 256,000 tokens |
qwen/qwen3.6-35b-a3b | $0.25 | $1.20 | 262,144 tokens |
z-ai/glm-5 | $1.00 | $3.20 | 80,000 tokens |
z-ai/glm-5.1 | $1.40 | $4.40 | 198,000 tokens |
LLMBase routes your request automatically through our managed inference network. You reference models by their unified ID and do not need to configure routing yourself.
The inference model list is intentionally curated for direct API usage. The agent model list is derived from this same registry, but it only returns the chat/tool-capable models that are safe for subscription-backed agents. Use the direct inference API when you need OpenAI-compatible inference billing, prompt-cache pricing, and predictable API costs. Use Agent integrations when you want an OpenAI-compatible agent to consume a Pro chat subscription.
Choosing a model
- Coding agents and tool calling — Start with
deepseek/deepseek-v4-flash,z-ai/glm-5.1, orqwen/qwen3-coder. - Long-context reasoning — Use
deepseek/deepseek-v4-pro,deepseek/deepseek-v3.2, ormoonshotai/kimi-k2.5. - Fast everyday inference — Use
qwen/qwen3.5-9b,qwen/qwen3.5-35b-a3b, orminimax/minimax-m2.5.
For production systems, choose by capability first and price second:
| Workload | What to inspect |
|---|---|
| Agent loop | supported_features includes tools; check cache-read pricing |
| JSON extraction | supported_features includes json_mode or structured_outputs |
| Reasoning traces | supported_features includes reasoning; supported_parameters includes reasoning_effort |
| Ranking or confidence | supported_features includes logprobs |
| Vision or OCR | input_modalities includes image or file |
| Long documents | context_length, max_output_length, and token price |
If your client sends a capability the selected model does not advertise,
LLMBase returns a 400 error instead of running an incompatible request.
List models — GET /v1/models
Returns all available models in the OpenAI models format. The Worker does not require account authorization for this endpoint, but send your normal Bearer header anyway. The header lets Cloudflare recognize the request as API traffic and skip browser-style challenge handling.
curl https://api.llmbase.ai/v1/models \
-H "Authorization: Bearer $LLMBASE_API_KEY"
Response
{
"object": "list",
"data": [
{
"id": "deepseek/deepseek-v4-flash",
"object": "model",
"created": 1700000000,
"owned_by": "deepseek",
"name": "DeepSeek: DeepSeek V4 Flash",
"description": "Efficiency-focused DeepSeek V4 MoE model for high-throughput coding, reasoning, and agent workflows.",
"context_length": 1048576
},
{
"id": "deepseek/deepseek-v4-pro",
"object": "model",
"created": 1700000000,
"owned_by": "deepseek",
"name": "DeepSeek: DeepSeek V4 Pro",
"description": "Flagship DeepSeek V4 MoE model for long-context reasoning, coding, and autonomous agent tasks.",
"context_length": 1048576
}
]
}
With the OpenAI SDK
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.llmbase.ai/v1",
apiKey: process.env.LLMBASE_API_KEY,
});
const models = await client.models.list();
for (const model of models.data) {
console.log(model.id);
}
Agent-compatible models
OpenAI-compatible agents that use a llmbase_chat_... key should list models
from the agent API, not from the direct inference API:
curl https://llmbase.ai/api/v1/agents/models \
-H "Authorization: Bearer $LLMBASE_CHAT_AGENT_KEY"
This endpoint is backed by the same model registry as GET /v1/models, but it
returns only chat/tool-capable models available for subscription-backed agents.
The list is generated dynamically from the inference registry, so agents inherit
new eligible chat/tool-capable models without LLMBase maintaining a
separate static OpenClaw or Hermes allowlist.
Models that are not returned by /api/v1/agents/models cannot be used with
llmbase_chat_... keys. Use a llmbase_... key and
https://api.llmbase.ai/v1 when you need direct inference billing or a model
outside the agent-compatible list.
Smart fallback
POST /v1/chat/completions uses smart fallback by default. You always send one
unified model ID; if the requested model is temporarily unavailable, LLMBase may
serve a capability-safe fallback model instead.
Fallback candidates must satisfy the request’s required modalities and features.
Requests with images, tools, or structured output only fall back to matching
models. Fallback selection is price-aware, and when fallback applies, LLMBase
bills the cheaper of the requested model and the served model for the actual
token usage.
Response headers report x-llmbase-requested-model,
x-llmbase-served-model, and x-llmbase-fallback-applied, with fallback reason
and chain headers when fallback is used. Send X-LLMBase-Fallback: off to
hard-fail instead. The initial rollout covers /v1/chat/completions.
Model metadata and capabilities
Use the OpenAI-compatible GET /v1/models endpoint for SDK model discovery. Use
the model metadata endpoint when your application needs richer production
metadata such as context length, maximum output length, pricing, modalities,
supported sampling parameters, and certified features:
curl "https://api.llmbase.ai/v1/models?metadata=true" \
-H "Authorization: Bearer $LLMBASE_API_KEY"
GET /v1/model-metadata is kept as a compatibility alias for the same rich
metadata format. New clients should prefer /v1/models?metadata=true so model
discovery and metadata use one endpoint family.
Each entry includes:
| Field | Description |
|---|---|
id | Stable LLMBase model ID used in API requests |
context_length | Maximum input + output context window |
max_output_length | Maximum generated tokens for one response |
input_modalities / output_modalities | Supported text/image input and output modes |
pricing.prompt / pricing.completion | USD per input or output token |
pricing.input_cache_read | Cached-input token price when prompt-cache reads are supported |
supported_parameters | Parameters such as temperature, top_p, max_tokens, logprobs, top_logprobs, or reasoning_effort |
supported_features | Higher-level features such as tools, json_mode, structured_outputs, reasoning, or logprobs |
Before sending advanced options like response_format, tools, logprobs, or
top_logprobs, choose a model that advertises the matching capability. If a
request asks for a feature that the selected model does not support, LLMBase
returns an OpenAI-style 400 error instead of running an incompatible request.
Current structured-output and reasoning models
The following production models currently advertise both JSON modes
(json_mode and structured_outputs) and reasoning support. They can be used
for schema-validated extraction with response_format, and for reasoning
workflows with reasoning_effort or the supported thinking template flags:
| Model ID | JSON mode | JSON Schema | Reasoning | Notes |
|---|---|---|---|---|
deepseek/deepseek-v4-flash | Yes | Yes | Yes | Fast DeepSeek V4 route for high-throughput agents and extraction |
deepseek/deepseek-v4-pro | Yes | Yes | Yes | Flagship DeepSeek V4 route for long-context reasoning and coding |
deepseek/deepseek-v3.2 | Yes | Yes | Yes | Stable DeepSeek structured-output baseline |
google/gemma-4-26b-a4b-it | Yes | Yes | Yes | Multimodal Gemma route with text, image, and file input |
z-ai/glm-5.1 | Yes | Yes | Yes | Current GLM flagship route for agentic engineering |
z-ai/glm-5 | Yes | Yes | Yes | GLM 5 also advertises logprobs on certified paths |
moonshotai/kimi-k2.6 | Yes | Yes | Yes | Long-horizon Kimi route with multimodal input and cache-read pricing |
moonshotai/kimi-k2.5 | Yes | Yes | Yes | Kimi route for visual reasoning, coding, and tool calling |
Additional models may advertise a subset of these capabilities. For example,
qwen/qwen3-coder supports tools and structured outputs, while
paddlepaddle/paddleocr-vl-0.9b supports JSON modes for OCR-style extraction.
Always rely on /v1/models?metadata=true at runtime instead of hard-coding a
static capability list.
Filter models programmatically
const res = await fetch("https://api.llmbase.ai/v1/models?metadata=true", {
headers: { Authorization: `Bearer ${process.env.LLMBASE_API_KEY}` },
});
const { data } = await res.json();
const toolModels = data.filter((model) =>
model.supported_features?.includes("tools") &&
model.pricing?.input_cache_read
);
console.log(toolModels.map((model) => model.id));
For agent-capable models only, add the agents filter:
curl "https://api.llmbase.ai/v1/models?filter=agents&metadata=true" \
-H "Authorization: Bearer $LLMBASE_API_KEY"
This is useful for agents and SaaS products where model lists should update automatically as LLMBase adds new eligible models.
Prompt-cache pricing
Some inference models support prompt-cache reads. When a response reports cached
prompt tokens in usage.prompt_tokens_details.cached_tokens, LLMBase bills those
cached input tokens at that model’s cache-read price instead of the normal
input-token price.
Models that do not report cached tokens, or do not have cache-read pricing, are billed at the normal input-token price. This prevents unsupported models from receiving an accidental discount.
You can inspect cache capability in the model metadata endpoint:
curl "https://api.llmbase.ai/v1/models?metadata=true" \
-H "Authorization: Bearer $LLMBASE_API_KEY"
Models with cache-read pricing include pricing.input_cache_read in the
response.
Unsupported model families
The direct inference API is intentionally curated for chat and multimodal
inference. Model families such as embeddings, rerankers, native image
generation, audio generation, and native classifiers are not exposed
through POST /v1/chat/completions unless they are represented as a supported
chat model in /v1/models.
For those workloads, use the matching LLMBase product surface when available or a native API designed for that model family.