Dedicated Inference
Deploy open-source and custom AI models on isolated GPUs in European data centers. Managed endpoints with an OpenAI-compatible API.
Updated
Dedicated Inference gives you a fully managed inference endpoint on a GPU reserved exclusively for your workload. You get an OpenAI-compatible API URL — LLMBase handles the deployment, model loading, runtime, and hardware. There is no SSH access, no container configuration, and no infrastructure to manage.
Coming soon. Dedicated Inference endpoints will be created and managed from the LLMBase dashboard. Contact enterprise@llmbase.ai to get early access.
What you get
When you create a dedicated endpoint you receive a unique base URL:
https://<endpoint-id>-inference.llmbase.cloud
This endpoint exposes the standard OpenAI-compatible chat completions interface:
POST https://<endpoint-id>-inference.llmbase.cloud/v1/chat/completions
Use it with the OpenAI SDK, any OpenAI-compatible client, or plain HTTP — the same way you would use https://api.llmbase.ai/v1.
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://<endpoint-id>-inference.llmbase.cloud/v1",
apiKey: process.env.LLMBASE_API_KEY,
});
const response = await client.chat.completions.create({
model: "deepseek/deepseek-v4-flash",
messages: [{ role: "user", content: "Hello!" }],
});
Supported models
Dedicated endpoints support:
- Open-source foundation models — any model available in the LLMBase model catalog
- Your own fine-tuned models — bring weights you have fine-tuned on your own data
- Custom models — deploy a private model not in the public catalog
For custom and fine-tuned models, you provide the model weights (GGUF, safetensors, or vLLM-compatible format) during endpoint setup. LLMBase loads and serves them on the reserved GPU — you do not configure vLLM, Triton, or any inference runtime yourself.
How it differs from Serverless Inference
| Dedicated Inference | Serverless Inference | |
|---|---|---|
| GPU | Reserved exclusively for you | Shared across users |
| Pricing | Fixed hourly rate per GPU | Per input/output token |
| Rate limits | None — only your GPU capacity | Shared pool limits apply |
| Latency | Consistent, no cold starts | May vary under load |
| Custom models | Yes — bring your own weights | No |
| Setup | Endpoint created via dashboard | API key only |
| Best for | Steady high-throughput workloads | Spiky or experimental traffic |
Choose dedicated inference when you run continuous high-throughput workloads, need predictable latency, or want to deploy a custom or fine-tuned model. Choose Serverless Inference when you need to start quickly, have unpredictable traffic, or want pay-per-token billing.
Isolation and compliance
Each dedicated endpoint runs on a GPU that is not shared with any other customer during its lifetime. Compute and networking resources are fully isolated. All endpoints are hosted in European data centers (Falkenstein, Germany and Helsinki, Finland) and are GDPR compliant.
Pricing
Dedicated inference is billed at a fixed hourly rate based on the GPU type selected. There are no per-token charges and no rate limits. The minimum billing unit is one hour.
See the dedicated endpoints pricing page for current GPU rates.
Creating an endpoint
Endpoints are created from the LLMBase dashboard:
- Open Dashboard → Inference → Dedicated
- Select a GPU type and data center location
- Choose a model from the catalog, or upload custom weights
- Click Deploy — the endpoint is ready within a few minutes
- Copy the endpoint URL and use it in your application
The endpoint URL follows the pattern https://<endpoint-id>-inference.llmbase.cloud/v1.
Authentication
Use your LLMBase API key in the Authorization header, the same as the Serverless Inference API:
Authorization: Bearer <LLMBASE_API_KEY>
See Authentication for how to create and manage API keys.