Inference
Chat completions
Full reference for POST /v1/chat/completions — the core inference endpoint.
Updated
POST https://api.llmbase.ai/v1/chat/completions
The chat completions endpoint follows the OpenAI Chat API. OpenAI SDK chat clients can use it by changing the base URL, API key, and model ID.
This page documents the direct inference API. If you are configuring OpenClaw,
Hermes, or another external agent to use a Pro chat subscription, use
https://llmbase.ai/api/v1/agents/chat/completions with a
llmbase_chat_... key instead. See Agent integrations.
Request body
Required fields
| Field | Type | Description |
|---|---|---|
model | string | Model ID, e.g. "deepseek/deepseek-v4-flash". See Models. |
messages | array | Conversation history. At least one message required. |
Messages
Each message is an object with a role and content. Assistant messages can
also include tool_calls, and tool results use role: "tool" with the matching
tool_call_id.
[
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Summarise this article: ..." },
{ "role": "assistant", "content": "Here is a summary: ..." },
{ "role": "user", "content": "Make it shorter." }
]
Roles:
| Role | Description |
|---|---|
system | Sets the behaviour and persona of the assistant |
user | A message from the end user |
assistant | A previous response from the model (for multi-turn conversations) |
tool | Result returned by your application after executing a model-requested tool |
User messages support multimodal content (text + images) as an array of parts:
{
"role": "user",
"content": [
{ "type": "text", "text": "What is in this image?" },
{ "type": "image_url", "image_url": { "url": "https://example.com/image.jpg" } }
]
}
Optional parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
stream | boolean | false | Stream tokens as SSE. See Streaming. |
temperature | number | model default | Sampling temperature 0–2. Lower = more deterministic. |
top_p | number | model default | Nucleus sampling probability mass 0–1. |
top_k | number | model default | Limit sampling to the top K tokens when supported by the selected model. |
min_p | number | model default | Minimum probability sampling cutoff when supported. |
max_tokens | integer | model max | Maximum tokens to generate. |
frequency_penalty | number | 0 | Penalises repeated tokens by frequency -2.0–2.0. |
presence_penalty | number | 0 | Penalises tokens that have appeared at all -2.0–2.0. |
repetition_penalty | number | model default | Provider-supported repetition penalty. |
stop | string | string[] | — | Up to 4 sequences where generation stops. |
seed | integer | — | Fixed seed for deterministic outputs (best-effort). |
logprobs | boolean | — | Return output-token log probabilities on models that expose them. |
top_logprobs | integer | — | Return the most likely token alternatives for each generated token. Requires logprob support. |
response_format | object | — | Request JSON output. See Structured outputs. |
reasoning_effort | string | — | Portable reasoning-effort hint on models that advertise reasoning. Common values are low, medium, and high. |
prompt_cache_key | string | — | LLMBase prompt-cache namespace. Cached prompt tokens are billed at the model’s cache-read price when supported and reported. |
tools | array | — | OpenAI-compatible function tool definitions. |
tool_choice | string | object | auto | auto, none, required, or a specific function tool. |
LLMBase documents the portable OpenAI-compatible fields it supports directly.
Vendor-specific routing fields such as provider, route, models,
plugins, debug, service_tier, and top-level cache_control are not part
of the LLMBase direct inference contract. Route by choosing an LLMBase model ID;
use prompt_cache_key for prompt-cache grouping.
Choosing request options
Start with the smallest request surface that solves the job:
| Goal | Recommended fields |
|---|---|
| Chat UI | messages, stream: true, max_tokens |
| Backend extraction | response_format, low temperature, max_tokens |
| Tool-using agent | tools, tool_choice, prompt_cache_key |
| Confidence scoring | logprobs, top_logprobs on a model that supports logprobs |
| Cost control | max_tokens, prompt_cache_key, model pricing from /v1/model-metadata |
If a request needs a model-specific capability, choose the model from
/v1/model-metadata first. LLMBase rejects unsupported combinations instead
of silently ignoring required features.
Prompt caching
LLMBase applies prompt_cache_key on models that support OpenAI-compatible
prompt caching. Use a stable key per application, user, workspace, or long-lived
agent conversation when the beginning of the prompt stays mostly the same.
Prompt caching is automatic on supported models after the cache is warm. The
first request normally writes or warms the cache and is billed as regular input.
Later requests with the same stable prompt prefix can return
usage.prompt_tokens_details.cached_tokens. Those cached prompt tokens are then
billed at the model’s input_cache_read price when that price is configured.
Subscription-agent requests use the same prompt-cache behavior. LLMBase adds a
stable prompt_cache_key for agent calls when the client does not provide one,
so compatible models can reuse repeated long prompt prefixes.
{
"model": "deepseek/deepseek-v4-flash",
"prompt_cache_key": "workspace-123-agent",
"messages": [
{ "role": "system", "content": "You are a coding agent for this repository..." },
{ "role": "user", "content": "Continue from the last task." }
]
}
Keep the cacheable prefix stable. Put changing instructions, timestamps, temporary files, and user questions later in the message list so repeated requests can reuse the cached prefix. Short prompts may not be cached by the selected model.
Prompt layout for cache hits
Good cache candidates are long, repeated prefixes:
[
{ "role": "system", "content": "Long product policy, coding rules, or workspace instructions..." },
{ "role": "user", "content": "Current short task goes here." }
]
Avoid putting timestamps, request IDs, random examples, or the latest user message before a large static prompt. If the first part of the prompt changes on every request, the selected model cannot reuse the cached prefix.
Cache usage and billing
The response usage object reports cache hits:
{
"usage": {
"prompt_tokens": 4096,
"completion_tokens": 256,
"total_tokens": 4352,
"prompt_tokens_details": {
"cached_tokens": 3072
}
}
}
If cached_tokens is absent or 0, the request did not receive a cache hit. If
cached tokens are reported but the selected model has no
input_cache_read price, LLMBase bills those tokens at the normal input-token
price.
Estimating request cost
Use the model metadata endpoint to read token prices before you run a workload:
curl https://api.llmbase.ai/v1/model-metadata \
-H "Authorization: Bearer $LLMBASE_API_KEY"
For a rough estimate:
cost = input_tokens * pricing.prompt
+ cached_input_tokens * pricing.input_cache_read
+ output_tokens * pricing.completion
When pricing.input_cache_read is missing, cached input tokens use the normal
input-token price.
Non-streaming response
{
"id": "chatcmpl-a1b2c3d4e5f6a1b2c3d4e5f6",
"object": "chat.completion",
"created": 1741000000,
"model": "deepseek/deepseek-v4-flash",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! I can help you with a wide range of tasks..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 48,
"total_tokens": 60,
"prompt_tokens_details": {
"cached_tokens": 8
}
}
}
finish_reason values
| Value | Meaning |
|---|---|
stop | Model finished naturally |
length | max_tokens limit reached |
content_filter | Response was filtered |
tool_calls | Model called one or more tools |
Tool calling
Tool calling follows the OpenAI chat completions format. Send function
definitions in tools. When the model returns finish_reason: "tool_calls",
execute the requested tool in your application, then send the result back as a
tool message with the same tool_call_id.
const weatherTool = {
type: "function",
function: {
name: "get_current_weather",
parameters: {
type: "object",
properties: {
city: { type: "string" },
unit: { type: "string", enum: ["celsius", "fahrenheit"] },
},
required: ["city", "unit"],
},
},
} as const;
const first = await client.chat.completions.create({
model: "deepseek/deepseek-v4-flash",
messages: [{ role: "user", content: "What is the weather in Berlin?" }],
tools: [weatherTool],
});
const toolCall = first.choices[0]?.message.tool_calls?.[0];
if (toolCall) {
const final = await client.chat.completions.create({
model: "deepseek/deepseek-v4-flash",
messages: [
{ role: "user", content: "What is the weather in Berlin?" },
first.choices[0].message,
{
role: "tool",
tool_call_id: toolCall.id,
content: JSON.stringify({ city: "Berlin", temperature: 14, unit: "celsius" }),
},
],
tools: [weatherTool],
});
console.log(final.choices[0]?.message.content);
}
Structured outputs
Use response_format when downstream code needs JSON output. LLMBase supports
the OpenAI-compatible json_object and json_schema request shapes on models
that advertise that feature.
{
"model": "deepseek/deepseek-v3.2",
"messages": [
{ "role": "user", "content": "Return three product risks as JSON." }
],
"response_format": { "type": "json_object" }
}
For stricter validation, send a JSON Schema:
{
"model": "deepseek/deepseek-v3.2",
"messages": [
{ "role": "user", "content": "Extract the company name and priority." }
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "ticket",
"schema": {
"type": "object",
"properties": {
"company": { "type": "string" },
"priority": { "type": "string", "enum": ["low", "medium", "high"] }
},
"required": ["company", "priority"],
"additionalProperties": false
}
}
}
}
Current production models that advertise both json_mode and
structured_outputs include deepseek/deepseek-v4-flash,
deepseek/deepseek-v4-pro, deepseek/deepseek-v3.2,
google/gemma-4-26b-a4b-it, z-ai/glm-5.1, z-ai/glm-5,
moonshotai/kimi-k2.6, and moonshotai/kimi-k2.5. Check
/v1/models?metadata=true for the live list before routing production traffic.
If the selected model cannot satisfy the requested feature set, LLMBase returns
400 with invalid_request_error instead of running an incompatible request.
Inspect supported_features in /v1/model-metadata when you need to choose
models programmatically.
Picking JSON mode vs JSON Schema
Use json_object when your application can validate or repair the final shape
itself. Use json_schema when missing fields or wrong enum values should be
treated as model errors.
Structured output is still model generation, not a database constraint. Keep schemas compact, include all required fields, and validate the returned JSON in your application before writing to production systems.
Reasoning and thinking control
Models that advertise reasoning can expose the model’s thinking trace as
reasoning_content on the assistant message. LLMBase also includes a
compatibility alias named reasoning when upstream providers return that field.
Use reasoning_effort for portable effort control when the selected model lists
that parameter in supported_parameters:
{
"model": "deepseek/deepseek-v4-pro",
"messages": [
{ "role": "user", "content": "Compare these two migration plans." }
],
"reasoning_effort": "high"
}
Some model families also expose thinking through chat template flags. LLMBase
forwards the safe, portable flags below from extra_body.chat_template_kwargs
to supported model routes:
{
"model": "google/gemma-4-26b-a4b-it",
"messages": [
{ "role": "user", "content": "Solve 17 * 23 and show the final answer." }
],
"extra_body": {
"chat_template_kwargs": {
"enable_thinking": true,
"thinking": true,
"preserve_thinking": true
}
}
}
Typical non-streaming responses include the final answer in content and the
reasoning trace separately:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "17 * 23 = 391.",
"reasoning_content": "Compute 17 * 20 = 340 and 17 * 3 = 51, then add them.",
"reasoning": "Compute 17 * 20 = 340 and 17 * 3 = 51, then add them."
}
}
]
}
Reasoning support is model-dependent. If your request depends on reasoning,
choose a model whose metadata includes supported_features: ["reasoning"] and
supported_parameters containing reasoning_effort.
Streaming
Set "stream": true to receive a stream of
Server-Sent Events.
Each event is a JSON-encoded chat.completion.chunk.
curl https://api.llmbase.ai/v1/chat/completions \
-H "Authorization: Bearer $LLMBASE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek/deepseek-v4-flash",
"messages": [{ "role": "user", "content": "Count to 5." }],
"stream": true
}'
The stream is a sequence of data: {...} lines, terminated by data: [DONE]:
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1741000000,"model":"deepseek/deepseek-v4-flash","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1741000000,"model":"deepseek/deepseek-v4-flash","choices":[{"index":0,"delta":{"content":"1"},"finish_reason":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1741000000,"model":"deepseek/deepseek-v4-flash","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":10,"completion_tokens":9,"total_tokens":19}}
data: [DONE]
Token usage is included in the final chunk of every stream.
Streaming with the OpenAI SDK
const stream = await client.chat.completions.create({
model: "deepseek/deepseek-v4-flash",
messages: [{ role: "user", content: "Write a short poem." }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}
Full example
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.llmbase.ai/v1",
apiKey: process.env.LLMBASE_API_KEY,
});
const response = await client.chat.completions.create({
model: "deepseek/deepseek-v4-flash",
messages: [
{ role: "system", content: "You are a concise assistant." },
{ role: "user", content: "Explain recursion in one sentence." },
],
temperature: 0.7,
max_tokens: 100,
});
console.log(response.choices[0].message.content);
// → "Recursion is when a function calls itself until a base condition is met."
Error responses
Errors are returned as JSON with an error object:
{
"error": {
"message": "Model not found: unknown/model",
"type": "invalid_request_error"
}
}
| HTTP status | Meaning |
|---|---|
400 | Bad request — missing or invalid fields |
401 | Authentication failed — check your API key |
404 | Model not found |
502 | LLMBase model backend unavailable — retry with backoff |