Dedicated Inference

Deploy open-source and custom AI models on isolated GPUs in European data centers. Managed endpoints with an OpenAI-compatible API.

Updated May 20, 2026

Dedicated Inference gives you a fully managed inference endpoint on a GPU reserved exclusively for your workload. You get an OpenAI-compatible API URL — LLMBase handles the deployment, model loading, runtime, and hardware. There is no SSH access, no container configuration, and no infrastructure to manage.

Coming soon. Dedicated Inference endpoints will be created and managed from the LLMBase dashboard. Contact enterprise@llmbase.ai to get early access.

What you get

When you create a dedicated endpoint you receive a unique base URL:

https://<endpoint-id>-inference.llmbase.cloud

This endpoint exposes the standard OpenAI-compatible chat completions interface:

POST https://<endpoint-id>-inference.llmbase.cloud/v1/chat/completions

Use it with the OpenAI SDK, any OpenAI-compatible client, or plain HTTP — the same way you would use https://api.llmbase.ai/v1.

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://<endpoint-id>-inference.llmbase.cloud/v1",
  apiKey: process.env.LLMBASE_API_KEY,
});

const response = await client.chat.completions.create({
  model: "deepseek/deepseek-v4-flash",
  messages: [{ role: "user", content: "Hello!" }],
});

Supported models

Dedicated endpoints support:

Open-source foundation models — any model available in the LLMBase model catalog
Your own fine-tuned models — bring weights you have fine-tuned on your own data
Custom models — deploy a private model not in the public catalog

For custom and fine-tuned models, you provide the model weights (GGUF, safetensors, or vLLM-compatible format) during endpoint setup. LLMBase loads and serves them on the reserved GPU — you do not configure vLLM, Triton, or any inference runtime yourself.

How it differs from Serverless Inference

	Dedicated Inference	Serverless Inference
GPU	Reserved exclusively for you	Shared across users
Pricing	Fixed hourly rate per GPU	Per input/output token
Rate limits	None — only your GPU capacity	Shared pool limits apply
Latency	Consistent, no cold starts	May vary under load
Custom models	Yes — bring your own weights	No
Setup	Endpoint created via dashboard	API key only
Best for	Steady high-throughput workloads	Spiky or experimental traffic

Choose dedicated inference when you run continuous high-throughput workloads, need predictable latency, or want to deploy a custom or fine-tuned model. Choose Serverless Inference when you need to start quickly, have unpredictable traffic, or want pay-per-token billing.

Isolation and compliance

Each dedicated endpoint runs on a GPU that is not shared with any other customer during its lifetime. Compute and networking resources are fully isolated. All endpoints are hosted in European data centers (Falkenstein, Germany and Helsinki, Finland) and are GDPR compliant.

Pricing

Dedicated inference is billed at a fixed hourly rate based on the GPU type selected. There are no per-token charges and no rate limits. The minimum billing unit is one hour.

See the dedicated endpoints pricing page for current GPU rates.

Creating an endpoint

Endpoints are created from the LLMBase dashboard:

Open Dashboard → Inference → Dedicated
Select a GPU type and data center location
Choose a model from the catalog, or upload custom weights
Click Deploy — the endpoint is ready within a few minutes
Copy the endpoint URL and use it in your application

The endpoint URL follows the pattern https://<endpoint-id>-inference.llmbase.cloud/v1.

Authentication

Use your LLMBase API key in the Authorization header, the same as the Serverless Inference API:

Authorization: Bearer <LLMBASE_API_KEY>

See Authentication for how to create and manage API keys.