LLMBase | Docs

Dedicated Inference

Deploy open-source and custom AI models on isolated GPUs in European data centers. Managed endpoints with an OpenAI-compatible API.

Updated


Dedicated Inference gives you a fully managed inference endpoint on a GPU reserved exclusively for your workload. You get an OpenAI-compatible API URL — LLMBase handles the deployment, model loading, runtime, and hardware. There is no SSH access, no container configuration, and no infrastructure to manage.

Coming soon. Dedicated Inference endpoints will be created and managed from the LLMBase dashboard. Contact enterprise@llmbase.ai to get early access.

What you get

When you create a dedicated endpoint you receive a unique base URL:

https://<endpoint-id>-inference.llmbase.cloud

This endpoint exposes the standard OpenAI-compatible chat completions interface:

POST https://<endpoint-id>-inference.llmbase.cloud/v1/chat/completions

Use it with the OpenAI SDK, any OpenAI-compatible client, or plain HTTP — the same way you would use https://api.llmbase.ai/v1.

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://<endpoint-id>-inference.llmbase.cloud/v1",
  apiKey: process.env.LLMBASE_API_KEY,
});

const response = await client.chat.completions.create({
  model: "deepseek/deepseek-v4-flash",
  messages: [{ role: "user", content: "Hello!" }],
});

Supported models

Dedicated endpoints support:

  • Open-source foundation models — any model available in the LLMBase model catalog
  • Your own fine-tuned models — bring weights you have fine-tuned on your own data
  • Custom models — deploy a private model not in the public catalog

For custom and fine-tuned models, you provide the model weights (GGUF, safetensors, or vLLM-compatible format) during endpoint setup. LLMBase loads and serves them on the reserved GPU — you do not configure vLLM, Triton, or any inference runtime yourself.

How it differs from Serverless Inference

Dedicated InferenceServerless Inference
GPUReserved exclusively for youShared across users
PricingFixed hourly rate per GPUPer input/output token
Rate limitsNone — only your GPU capacityShared pool limits apply
LatencyConsistent, no cold startsMay vary under load
Custom modelsYes — bring your own weightsNo
SetupEndpoint created via dashboardAPI key only
Best forSteady high-throughput workloadsSpiky or experimental traffic

Choose dedicated inference when you run continuous high-throughput workloads, need predictable latency, or want to deploy a custom or fine-tuned model. Choose Serverless Inference when you need to start quickly, have unpredictable traffic, or want pay-per-token billing.

Isolation and compliance

Each dedicated endpoint runs on a GPU that is not shared with any other customer during its lifetime. Compute and networking resources are fully isolated. All endpoints are hosted in European data centers (Falkenstein, Germany and Helsinki, Finland) and are GDPR compliant.

Pricing

Dedicated inference is billed at a fixed hourly rate based on the GPU type selected. There are no per-token charges and no rate limits. The minimum billing unit is one hour.

See the dedicated endpoints pricing page for current GPU rates.

Creating an endpoint

Endpoints are created from the LLMBase dashboard:

  1. Open Dashboard → Inference → Dedicated
  2. Select a GPU type and data center location
  3. Choose a model from the catalog, or upload custom weights
  4. Click Deploy — the endpoint is ready within a few minutes
  5. Copy the endpoint URL and use it in your application

The endpoint URL follows the pattern https://<endpoint-id>-inference.llmbase.cloud/v1.

Authentication

Use your LLMBase API key in the Authorization header, the same as the Serverless Inference API:

Authorization: Bearer <LLMBASE_API_KEY>

See Authentication for how to create and manage API keys.