LLMBase | Docs

Inference

Chat completions

Full reference for POST /v1/chat/completions — the core inference endpoint.

Updated


POST https://api.llmbase.ai/v1/chat/completions

The chat completions endpoint follows the OpenAI Chat API. OpenAI SDK chat clients can use it by changing the base URL, API key, and model ID.

This page documents the direct inference API. If you are configuring OpenClaw, Hermes, or another external agent to use a Pro chat subscription, use https://llmbase.ai/api/v1/agents/chat/completions with a llmbase_chat_... key instead. See Agent integrations.

Request body

Required fields

FieldTypeDescription
modelstringModel ID, e.g. "deepseek/deepseek-v4-flash". See Models.
messagesarrayConversation history. At least one message required.

Messages

Each message is an object with a role and content. Assistant messages can also include tool_calls, and tool results use role: "tool" with the matching tool_call_id.

[
  { "role": "system",    "content": "You are a helpful assistant." },
  { "role": "user",      "content": "Summarise this article: ..." },
  { "role": "assistant", "content": "Here is a summary: ..." },
  { "role": "user",      "content": "Make it shorter." }
]

Roles:

RoleDescription
systemSets the behaviour and persona of the assistant
userA message from the end user
assistantA previous response from the model (for multi-turn conversations)
toolResult returned by your application after executing a model-requested tool

User messages support multimodal content (text + images) as an array of parts:

{
  "role": "user",
  "content": [
    { "type": "text", "text": "What is in this image?" },
    { "type": "image_url", "image_url": { "url": "https://example.com/image.jpg" } }
  ]
}

Optional parameters

ParameterTypeDefaultDescription
streambooleanfalseStream tokens as SSE. See Streaming.
temperaturenumbermodel defaultSampling temperature 0–2. Lower = more deterministic.
top_pnumbermodel defaultNucleus sampling probability mass 0–1.
top_knumbermodel defaultLimit sampling to the top K tokens when supported by the selected model.
min_pnumbermodel defaultMinimum probability sampling cutoff when supported.
max_tokensintegermodel maxMaximum tokens to generate.
frequency_penaltynumber0Penalises repeated tokens by frequency -2.0–2.0.
presence_penaltynumber0Penalises tokens that have appeared at all -2.0–2.0.
repetition_penaltynumbermodel defaultProvider-supported repetition penalty.
stopstring | string[]Up to 4 sequences where generation stops.
seedintegerFixed seed for deterministic outputs (best-effort).
logprobsbooleanReturn output-token log probabilities on models that expose them.
top_logprobsintegerReturn the most likely token alternatives for each generated token. Requires logprob support.
response_formatobjectRequest JSON output. See Structured outputs.
reasoning_effortstringPortable reasoning-effort hint on models that advertise reasoning. Common values are low, medium, and high.
prompt_cache_keystringLLMBase prompt-cache namespace. Cached prompt tokens are billed at the model’s cache-read price when supported and reported.
toolsarrayOpenAI-compatible function tool definitions.
tool_choicestring | objectautoauto, none, required, or a specific function tool.

LLMBase documents the portable OpenAI-compatible fields it supports directly. Vendor-specific routing fields such as provider, route, models, plugins, debug, service_tier, and top-level cache_control are not part of the LLMBase direct inference contract. Route by choosing an LLMBase model ID; use prompt_cache_key for prompt-cache grouping.

Choosing request options

Start with the smallest request surface that solves the job:

GoalRecommended fields
Chat UImessages, stream: true, max_tokens
Backend extractionresponse_format, low temperature, max_tokens
Tool-using agenttools, tool_choice, prompt_cache_key
Confidence scoringlogprobs, top_logprobs on a model that supports logprobs
Cost controlmax_tokens, prompt_cache_key, model pricing from /v1/model-metadata

If a request needs a model-specific capability, choose the model from /v1/model-metadata first. LLMBase rejects unsupported combinations instead of silently ignoring required features.

Smart fallback

POST /v1/chat/completions uses smart fallback by default. If the requested model is temporarily unavailable, LLMBase may serve the request with an eligible fallback model instead of returning a backend error.

Fallback is capability-safe and price-aware. Requests with images, tools, or response_format fall back only to models that support the same required modalities and features. When fallback applies, LLMBase bills the cheaper of the requested model and the served fallback model for the actual token usage.

Responses include fallback metadata in HTTP headers:

HeaderDescription
x-llmbase-requested-modelModel ID sent in the request
x-llmbase-served-modelModel ID that generated the response
x-llmbase-fallback-appliedtrue when fallback served the request, otherwise false
x-llmbase-fallback-reasonPresent when fallback applies
x-llmbase-fallback-chainComma-separated model chain when fallback applies

To require the requested model and hard-fail instead of falling back, send:

X-LLMBase-Fallback: off

The initial smart fallback rollout covers /v1/chat/completions.

Prompt caching

LLMBase applies prompt_cache_key on models that support OpenAI-compatible prompt caching. Use a stable key per application, user, workspace, or long-lived agent conversation when the beginning of the prompt stays mostly the same.

Prompt caching is automatic on supported models after the cache is warm. The first request normally writes or warms the cache and is billed as regular input. Later requests with the same stable prompt prefix can return usage.prompt_tokens_details.cached_tokens. Those cached prompt tokens are then billed at the model’s input_cache_read price when that price is configured.

Subscription-agent requests use the same prompt-cache behavior. LLMBase adds a stable prompt_cache_key for agent calls when the client does not provide one, so compatible models can reuse repeated long prompt prefixes.

{
  "model": "deepseek/deepseek-v4-flash",
  "prompt_cache_key": "workspace-123-agent",
  "messages": [
    { "role": "system", "content": "You are a coding agent for this repository..." },
    { "role": "user", "content": "Continue from the last task." }
  ]
}

Keep the cacheable prefix stable. Put changing instructions, timestamps, temporary files, and user questions later in the message list so repeated requests can reuse the cached prefix. Short prompts may not be cached by the selected model.

Prompt layout for cache hits

Good cache candidates are long, repeated prefixes:

[
  { "role": "system", "content": "Long product policy, coding rules, or workspace instructions..." },
  { "role": "user", "content": "Current short task goes here." }
]

Avoid putting timestamps, request IDs, random examples, or the latest user message before a large static prompt. If the first part of the prompt changes on every request, the selected model cannot reuse the cached prefix.

Cache usage and billing

The response usage object reports cache hits:

{
  "usage": {
    "prompt_tokens": 4096,
    "completion_tokens": 256,
    "total_tokens": 4352,
    "prompt_tokens_details": {
      "cached_tokens": 3072
    }
  }
}

If cached_tokens is absent or 0, the request did not receive a cache hit. If cached tokens are reported but the selected model has no input_cache_read price, LLMBase bills those tokens at the normal input-token price.

Estimating request cost

Use the model metadata endpoint to read token prices before you run a workload:

curl https://api.llmbase.ai/v1/model-metadata \
  -H "Authorization: Bearer $LLMBASE_API_KEY"

For a rough estimate:

cost = input_tokens * pricing.prompt
     + cached_input_tokens * pricing.input_cache_read
     + output_tokens * pricing.completion

When pricing.input_cache_read is missing, cached input tokens use the normal input-token price.

Non-streaming response

{
  "id": "chatcmpl-a1b2c3d4e5f6a1b2c3d4e5f6",
  "object": "chat.completion",
  "created": 1741000000,
  "model": "deepseek/deepseek-v4-flash",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I can help you with a wide range of tasks..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 48,
    "total_tokens": 60,
    "prompt_tokens_details": {
      "cached_tokens": 8
    }
  }
}

finish_reason values

ValueMeaning
stopModel finished naturally
lengthmax_tokens limit reached
content_filterResponse was filtered
tool_callsModel called one or more tools

Tool calling

Tool calling follows the OpenAI chat completions format. Send function definitions in tools. When the model returns finish_reason: "tool_calls", execute the requested tool in your application, then send the result back as a tool message with the same tool_call_id.

const weatherTool = {
  type: "function",
  function: {
    name: "get_current_weather",
    parameters: {
      type: "object",
      properties: {
        city: { type: "string" },
        unit: { type: "string", enum: ["celsius", "fahrenheit"] },
      },
      required: ["city", "unit"],
    },
  },
} as const;

const first = await client.chat.completions.create({
  model: "deepseek/deepseek-v4-flash",
  messages: [{ role: "user", content: "What is the weather in Berlin?" }],
  tools: [weatherTool],
});

const toolCall = first.choices[0]?.message.tool_calls?.[0];
if (toolCall) {
  const final = await client.chat.completions.create({
    model: "deepseek/deepseek-v4-flash",
    messages: [
      { role: "user", content: "What is the weather in Berlin?" },
      first.choices[0].message,
      {
        role: "tool",
        tool_call_id: toolCall.id,
        content: JSON.stringify({ city: "Berlin", temperature: 14, unit: "celsius" }),
      },
    ],
    tools: [weatherTool],
  });

  console.log(final.choices[0]?.message.content);
}

Structured outputs

Use response_format when downstream code needs JSON output. LLMBase supports the OpenAI-compatible json_object and json_schema request shapes on models that advertise that feature.

{
  "model": "deepseek/deepseek-v3.2",
  "messages": [
    { "role": "user", "content": "Return three product risks as JSON." }
  ],
  "response_format": { "type": "json_object" }
}

For stricter validation, send a JSON Schema:

{
  "model": "deepseek/deepseek-v3.2",
  "messages": [
    { "role": "user", "content": "Extract the company name and priority." }
  ],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "ticket",
      "schema": {
        "type": "object",
        "properties": {
          "company": { "type": "string" },
          "priority": { "type": "string", "enum": ["low", "medium", "high"] }
        },
        "required": ["company", "priority"],
        "additionalProperties": false
      }
    }
  }
}

Current production models that advertise both json_mode and structured_outputs include deepseek/deepseek-v4-flash, deepseek/deepseek-v4-pro, deepseek/deepseek-v3.2, google/gemma-4-26b-a4b-it, z-ai/glm-5.1, z-ai/glm-5, moonshotai/kimi-k2.6, and moonshotai/kimi-k2.5. Check /v1/models?metadata=true for the live list before routing production traffic.

If the selected model cannot satisfy the requested feature set, LLMBase returns 400 with invalid_request_error instead of running an incompatible request. Inspect supported_features in /v1/model-metadata when you need to choose models programmatically.

Picking JSON mode vs JSON Schema

Use json_object when your application can validate or repair the final shape itself. Use json_schema when missing fields or wrong enum values should be treated as model errors.

Structured output is still model generation, not a database constraint. Keep schemas compact, include all required fields, and validate the returned JSON in your application before writing to production systems.

Reasoning and thinking control

Models that advertise reasoning can expose the model’s thinking trace as reasoning_content on the assistant message. LLMBase also includes a compatibility alias named reasoning when upstream providers return that field.

Use reasoning_effort for portable effort control when the selected model lists that parameter in supported_parameters:

{
  "model": "deepseek/deepseek-v4-pro",
  "messages": [
    { "role": "user", "content": "Compare these two migration plans." }
  ],
  "reasoning_effort": "high"
}

Some model families also expose thinking through chat template flags. LLMBase forwards the safe, portable flags below from extra_body.chat_template_kwargs to supported model routes:

{
  "model": "google/gemma-4-26b-a4b-it",
  "messages": [
    { "role": "user", "content": "Solve 17 * 23 and show the final answer." }
  ],
  "extra_body": {
    "chat_template_kwargs": {
      "enable_thinking": true,
      "thinking": true,
      "preserve_thinking": true
    }
  }
}

Typical non-streaming responses include the final answer in content and the reasoning trace separately:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "17 * 23 = 391.",
        "reasoning_content": "Compute 17 * 20 = 340 and 17 * 3 = 51, then add them.",
        "reasoning": "Compute 17 * 20 = 340 and 17 * 3 = 51, then add them."
      }
    }
  ]
}

Reasoning support is model-dependent. If your request depends on reasoning, choose a model whose metadata includes supported_features: ["reasoning"] and supported_parameters containing reasoning_effort.

Streaming

Set "stream": true to receive a stream of Server-Sent Events. Each event is a JSON-encoded chat.completion.chunk.

curl https://api.llmbase.ai/v1/chat/completions \
  -H "Authorization: Bearer $LLMBASE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek/deepseek-v4-flash",
    "messages": [{ "role": "user", "content": "Count to 5." }],
    "stream": true
  }'

The stream is a sequence of data: {...} lines, terminated by data: [DONE]:

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1741000000,"model":"deepseek/deepseek-v4-flash","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1741000000,"model":"deepseek/deepseek-v4-flash","choices":[{"index":0,"delta":{"content":"1"},"finish_reason":null}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1741000000,"model":"deepseek/deepseek-v4-flash","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":10,"completion_tokens":9,"total_tokens":19}}

data: [DONE]

Token usage is included in the final chunk of every stream.

Streaming with the OpenAI SDK

const stream = await client.chat.completions.create({
  model: "deepseek/deepseek-v4-flash",
  messages: [{ role: "user", content: "Write a short poem." }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

Full example

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.llmbase.ai/v1",
  apiKey: process.env.LLMBASE_API_KEY,
});

const response = await client.chat.completions.create({
  model: "deepseek/deepseek-v4-flash",
  messages: [
    { role: "system", content: "You are a concise assistant." },
    { role: "user",   content: "Explain recursion in one sentence." },
  ],
  temperature: 0.7,
  max_tokens: 100,
});

console.log(response.choices[0].message.content);
// → "Recursion is when a function calls itself until a base condition is met."

Error responses

Errors are returned as JSON with an error object:

{
  "error": {
    "message": "Model not found: unknown/model",
    "type": "invalid_request_error"
  }
}
HTTP statusMeaning
400Bad request — missing or invalid fields
401Authentication failed — check your API key
404Model not found
502Model backend unavailable and no eligible fallback was served — retry with backoff