Need more power?
Dedicated endpoints for 100+ open-source models
Performance and reliability for production scale. Handle traffic spikes seamlessly. Best for load heavy applications.
Dedicated Endpoint
Try how fast it is!
Your own dedicated GPU instance to reliably deploy models with unmatched price-performance at scale.
- Guranteed Performance.
- Access your AI endpoints over a private, low-latency connection within a Virtual Private Cloud. Perfect for heavy workloads.
- Enterprise Security.
- Data sovereignty is ensured—your prompts and responses remain private, stored only in Europe, and inaccessible to third parties.
- Predictable Pricing.
- Dedicated GPU infrastructure ensures consistent and predictable performance, with unlimited tokens at a fixed hourly rate.
Chat with openai/gpt-5-nano
openai/gpt-5-nano
Advanced AI model for general-purpose tasks including text generation, analysis, and conversation.
Capabilities
Transparent Pricing
Simple, Predictable Pricing
Pay only for what you use. Dedicated GPU instances with unlimited tokens at a fixed hourly rate.
Small Models
Llama 3.1 8B Instruct
llama-3.1-8b-instruct
| GPU | Hourly | Monthly |
|---|---|---|
| L4-1-24G | €0.93 | ~€679 |
| L40S-1-48G | €1.72 | ~€1256 |
| H100-1-80G | €3.40 | ~€2482 |
| H100-2-80G | €6.68 | ~€4876 |
Mistral 7B Instruct v0.3
mistral-7b-instruct-v0.3
| GPU | Hourly | Monthly |
|---|---|---|
| L4-1-24G | €0.93 | ~€679 |
| L40S-1-48G | €1.72 | ~€1256 |
| H100-1-80G | €3.40 | ~€2482 |
| H100-2-80G | €6.68 | ~€4876 |
Mistral Nemo Instruct
mistral-nemo-instruct-2407
| GPU | Hourly | Monthly |
|---|---|---|
| L40S-1-48G | €1.72 | ~€1256 |
| H100-1-80G | €3.40 | ~€2482 |
| H100-2-80G | €6.68 | ~€4876 |
Large Models
Llama 3.3 70B Instruct
llama-3.3-70b-instruct
| GPU | Hourly | Monthly |
|---|---|---|
| H100-2-80G | €6.68 | ~€4876 |
Llama 3.1 70B Instruct
llama-3.1-70b-instruct
| GPU | Hourly | Monthly |
|---|---|---|
| H100-1-80G | €3.40 | ~€2482 |
| H100-2-80G | €6.68 | ~€4876 |
Llama 3.1 Nemotron 70B
llama-3.1-nemotron-70b-instruct
| GPU | Hourly | Monthly |
|---|---|---|
| H100-1-80G | €3.40 | ~€2482 |
| H100-2-80G | €6.68 | ~€4876 |
Molmo 72B
molmo-72b-0924
| GPU | Hourly | Monthly |
|---|---|---|
| H100-2-80G | €6.68 | ~€4876 |
Medium Models
Mixtral 8x7B Instruct v0.1
mixtral-8x7b-instruct-v0.1
| GPU | Hourly | Monthly |
|---|---|---|
| H100-1-80G | €3.40 | ~€2482 |
| H100-2-80G | €6.68 | ~€4876 |
Multimodal
Pixtral 12B
pixtral-12b-2409
| GPU | Hourly | Monthly |
|---|---|---|
| L40S-1-48G | €1.72 | ~€1256 |
| H100-1-80G | €3.40 | ~€2482 |
| H100-2-80G | €6.68 | ~€4876 |
Code Models
Qwen 2.5 Coder 32B
qwen2.5-coder-32b-instruct
| GPU | Hourly | Monthly |
|---|---|---|
| H100-1-80G | €3.40 | ~€2482 |
| H100-2-80G | €6.68 | ~€4876 |
Embedding Models
BGE Multilingual Gemma2
bge-multilingual-gemma2
| GPU | Hourly | Monthly |
|---|---|---|
| L4-1-24G | €0.93 | ~€679 |
| L40S-1-48G | €1.72 | ~€1256 |
Sentence T5 XXL
sentence-t5-xxl
| GPU | Hourly | Monthly |
|---|---|---|
| L4-1-24G | €0.93 | ~€679 |
GPU Specifications
L4-1-24G
1x NVIDIA L4 GPU
24GB VRAM
L40S-1-48G
1x NVIDIA L40S GPU
48GB VRAM
H100-1-80G
1x NVIDIA H100 GPU
80GB VRAM
H100-2-80G
2x NVIDIA H100 GPUs
160GB Total VRAM
* Monthly estimates based on 730 hours (30.4 days) of continuous usage
* All prices in EUR, excluding VAT
* Unlimited tokens included with all dedicated endpoints
* Cancel anytime, no long-term commitments required
Scale Your Deployment with Dedicated Endpoints
Deploy your AI models on dedicated infrastructure with guaranteed performance and unlimited tokens.
FAQ
Frequently Asked Questions
Everything you need to know about our dedicated endpoints
How can I start using this service?
You'll find here a comprehensive guide on getting started, including details on deployment, security, and billing.
If you need support, don't hesitate to reach out to us through the dedicated slack community #inference-beta
What are the security protocols for AI services?
Our AI services implement robust security measures to ensure customer data privacy and integrity. Our measures and policies are published on our documentation.
Can I use the OpenAI libraries and APIs?
We let you seamlessly transition applications already utilizing OpenAI. You can use any of the OpenAI official libraries, for example the OpenAI Python client library, to interact with your dedicated deployments.
Find here the APIs and parameters supported.
What are the advantages over mutualized LLM API services?
- Complete isolation of computing and networking resources to ensure maximum control for sensitive applications
- Consistent and predictable performance, unaffected by the activity of other users
- No strict rate limits—usage is only constrained by the maximum load your deployment can handle
- Access to a wider range of models
- More cost-effective with high utilization
Do you have pay-per-token hosted models?
Managed Inference deploys AI models and creates dedicated endpoints on a secure production infrastructure.
Alternatively, we have a selection of hosted models in our datacenters, priced per million tokens consumed, available via API. Find all details on the Inference API page.
I've got a request, where can I share it?
Tell us the good and the bad about your experience here. Thank you for your time!
What are the different types of AI inference?
Two broad categories of inference can be distinguished in the field of artificial intelligence:
Deductive Inference
Applies general rules to reach specific conclusions, such as a medical expert system that diagnoses a pathology based on symptoms.
Inductive Inference
Works the opposite way by deducing general principles from specific observations. A neural network that learns to recognize faces after analyzing thousands of photos is a prime example.
These two approaches are available in different deployment modes: batch inference for processing large volumes of data, and real-time inference for applications requiring instantaneous responses, such as autonomous vehicles.