Need more power?

Dedicated endpoints for 100+ open-source models

Performance and reliability for production scale. Handle traffic spikes seamlessly. Best for load heavy applications.

Get Started Learn More

Dedicated Endpoint

Try how fast it is!

Your own dedicated GPU instance to reliably deploy models with unmatched price-performance at scale.

Guranteed Performance.: Access your AI endpoints over a private, low-latency connection within a Virtual Private Cloud. Perfect for heavy workloads.
Enterprise Security.: Data sovereignty is ensured—your prompts and responses remain private, stored only in Europe, and inaccessible to third parties.
Predictable Pricing.: Dedicated GPU infrastructure ensures consistent and predictable performance, with unlimited tokens at a fixed hourly rate.

Chat with openai/gpt-5-nano

openai/gpt-5-nano

Advanced AI model for general-purpose tasks including text generation, analysis, and conversation.

Capabilities

Text Generation Q&A Analysis

0/200

Transparent Pricing

Simple, Predictable Pricing

Pay only for what you use. Dedicated GPU instances with unlimited tokens at a fixed hourly rate.

Small Models

Llama 3.1 8B Instruct

llama-3.1-8b-instruct

GPU	Hourly	Monthly
L4-1-24G	€0.93	~€679
L40S-1-48G	€1.72	~€1256
H100-1-80G	€3.40	~€2482
H100-2-80G	€6.68	~€4876

Mistral 7B Instruct v0.3

mistral-7b-instruct-v0.3

GPU	Hourly	Monthly
L4-1-24G	€0.93	~€679
L40S-1-48G	€1.72	~€1256
H100-1-80G	€3.40	~€2482
H100-2-80G	€6.68	~€4876

Mistral Nemo Instruct

mistral-nemo-instruct-2407

GPU	Hourly	Monthly
L40S-1-48G	€1.72	~€1256
H100-1-80G	€3.40	~€2482
H100-2-80G	€6.68	~€4876

Large Models

Llama 3.3 70B Instruct

llama-3.3-70b-instruct

GPU	Hourly	Monthly
H100-2-80G	€6.68	~€4876

Llama 3.1 70B Instruct

llama-3.1-70b-instruct

GPU	Hourly	Monthly
H100-1-80G	€3.40	~€2482
H100-2-80G	€6.68	~€4876

Llama 3.1 Nemotron 70B

llama-3.1-nemotron-70b-instruct

GPU	Hourly	Monthly
H100-1-80G	€3.40	~€2482
H100-2-80G	€6.68	~€4876

Molmo 72B

molmo-72b-0924

GPU	Hourly	Monthly
H100-2-80G	€6.68	~€4876

Medium Models

Mixtral 8x7B Instruct v0.1

mixtral-8x7b-instruct-v0.1

GPU	Hourly	Monthly
H100-1-80G	€3.40	~€2482
H100-2-80G	€6.68	~€4876

Multimodal

Pixtral 12B

pixtral-12b-2409

GPU	Hourly	Monthly
L40S-1-48G	€1.72	~€1256
H100-1-80G	€3.40	~€2482
H100-2-80G	€6.68	~€4876

Code Models

Qwen 2.5 Coder 32B

qwen2.5-coder-32b-instruct

GPU	Hourly	Monthly
H100-1-80G	€3.40	~€2482
H100-2-80G	€6.68	~€4876

Embedding Models

BGE Multilingual Gemma2

bge-multilingual-gemma2

GPU	Hourly	Monthly
L4-1-24G	€0.93	~€679
L40S-1-48G	€1.72	~€1256

Sentence T5 XXL

sentence-t5-xxl

GPU	Hourly	Monthly
L4-1-24G	€0.93	~€679

GPU Specifications

L4-1-24G

1x NVIDIA L4 GPU

24GB VRAM

L40S-1-48G

1x NVIDIA L40S GPU

48GB VRAM

H100-1-80G

1x NVIDIA H100 GPU

80GB VRAM

H100-2-80G

2x NVIDIA H100 GPUs

160GB Total VRAM

* Monthly estimates based on 730 hours (30.4 days) of continuous usage

* All prices in EUR, excluding VAT

* Unlimited tokens included with all dedicated endpoints

* Cancel anytime, no long-term commitments required

Scale Your Deployment with Dedicated Endpoints

Deploy your AI models on dedicated infrastructure with guaranteed performance and unlimited tokens.

Get started Learn more

FAQ

Frequently Asked Questions

Everything you need to know about our dedicated endpoints

How can I start using this service?

You'll find here a comprehensive guide on getting started, including details on deployment, security, and billing.

If you need support, don't hesitate to reach out to us through the dedicated slack community #inference-beta

What are the security protocols for AI services?

Our AI services implement robust security measures to ensure customer data privacy and integrity. Our measures and policies are published on our documentation.

Can I use the OpenAI libraries and APIs?

We let you seamlessly transition applications already utilizing OpenAI. You can use any of the OpenAI official libraries, for example the OpenAI Python client library, to interact with your dedicated deployments.

Find here the APIs and parameters supported.

What are the advantages over mutualized LLM API services?

Complete isolation of computing and networking resources to ensure maximum control for sensitive applications
Consistent and predictable performance, unaffected by the activity of other users
No strict rate limits—usage is only constrained by the maximum load your deployment can handle
Access to a wider range of models
More cost-effective with high utilization

Do you have pay-per-token hosted models?

Managed Inference deploys AI models and creates dedicated endpoints on a secure production infrastructure.

Alternatively, we have a selection of hosted models in our datacenters, priced per million tokens consumed, available via API. Find all details on the Inference API page.

I've got a request, where can I share it?

Tell us the good and the bad about your experience here. Thank you for your time!

What are the different types of AI inference?

Two broad categories of inference can be distinguished in the field of artificial intelligence:

Deductive Inference

Applies general rules to reach specific conclusions, such as a medical expert system that diagnoses a pathology based on symptoms.

Inductive Inference

Works the opposite way by deducing general principles from specific observations. A neural network that learns to recognize faces after analyzing thousands of photos is a prime example.

These two approaches are available in different deployment modes: batch inference for processing large volumes of data, and real-time inference for applications requiring instantaneous responses, such as autonomous vehicles.