Inference
Chat completions
Full reference for POST /v1/chat/completions — the core inference endpoint.
Updated
POST https://api.llmbase.ai/v1/chat/completions
The chat completions endpoint is compatible with the OpenAI Chat API. Any client that works with OpenAI will work here — change only the base URL and key.
Request body
Required fields
| Field | Type | Description |
|---|---|---|
model | string | Model ID, e.g. "zai-org/glm-5". See Models. |
messages | array | Conversation history. At least one message required. |
Messages
Each message is an object with a role and content:
[
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Summarise this article: ..." },
{ "role": "assistant", "content": "Here is a summary: ..." },
{ "role": "user", "content": "Make it shorter." }
]
Roles:
| Role | Description |
|---|---|
system | Sets the behaviour and persona of the assistant |
user | A message from the end user |
assistant | A previous response from the model (for multi-turn conversations) |
User messages support multimodal content (text + images) as an array of parts:
{
"role": "user",
"content": [
{ "type": "text", "text": "What is in this image?" },
{ "type": "image_url", "image_url": { "url": "https://example.com/image.jpg" } }
]
}
Optional parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
stream | boolean | false | Stream tokens as SSE. See Streaming. |
temperature | number | model default | Sampling temperature 0–2. Lower = more deterministic. |
top_p | number | model default | Nucleus sampling probability mass 0–1. |
max_tokens | integer | model max | Maximum tokens to generate. |
frequency_penalty | number | 0 | Penalises repeated tokens by frequency -2.0–2.0. |
presence_penalty | number | 0 | Penalises tokens that have appeared at all -2.0–2.0. |
stop | string | string[] | — | Up to 4 sequences where generation stops. |
seed | integer | — | Fixed seed for deterministic outputs (best-effort). |
Non-streaming response
{
"id": "chatcmpl-a1b2c3d4e5f6a1b2c3d4e5f6",
"object": "chat.completion",
"created": 1741000000,
"model": "zai-org/glm-5",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! I can help you with a wide range of tasks..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 48,
"total_tokens": 60
}
}
finish_reason values
| Value | Meaning |
|---|---|
stop | Model finished naturally |
length | max_tokens limit reached |
content_filter | Response was filtered |
tool_calls | Model called a tool (not yet supported) |
Streaming
Set "stream": true to receive a stream of
Server-Sent Events.
Each event is a JSON-encoded chat.completion.chunk.
curl https://api.llmbase.ai/v1/chat/completions \
-H "Authorization: Bearer $LLMBASE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "zai-org/glm-5",
"messages": [{ "role": "user", "content": "Count to 5." }],
"stream": true
}'
The stream is a sequence of data: {...} lines, terminated by data: [DONE]:
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1741000000,"model":"zai-org/glm-5","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1741000000,"model":"zai-org/glm-5","choices":[{"index":0,"delta":{"content":"1"},"finish_reason":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1741000000,"model":"zai-org/glm-5","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":10,"completion_tokens":9,"total_tokens":19}}
data: [DONE]
Token usage is included in the final chunk of every stream.
Streaming with the OpenAI SDK
const stream = await client.chat.completions.create({
model: "zai-org/glm-5",
messages: [{ role: "user", content: "Write a short poem." }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}
Full example
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.llmbase.ai/v1",
apiKey: process.env.LLMBASE_API_KEY,
});
const response = await client.chat.completions.create({
model: "zai-org/glm-5",
messages: [
{ role: "system", content: "You are a concise assistant." },
{ role: "user", content: "Explain recursion in one sentence." },
],
temperature: 0.7,
max_tokens: 100,
});
console.log(response.choices[0].message.content);
// → "Recursion is when a function calls itself until a base condition is met."
Error responses
Errors are returned as JSON with an error object:
{
"error": {
"message": "Model not found: unknown/model",
"type": "invalid_request_error"
}
}
| HTTP status | Meaning |
|---|---|
400 | Bad request — missing or invalid fields |
401 | Authentication failed — check your API key |
404 | Model not found |
502 | All upstream providers failed — retry with backoff |