How to Deploy an LLM On-Premises: A Step-by-Step Guide

Key Takeaways:

A single NVIDIA H100 GPU ($25,000–$40,000) can serve Llama 3 70B at over 12,500 tokens per second
vLLM and SGLang are the leading open-source inference engines, with SGLang achieving up to 29% higher throughput in some benchmarks
Production deployments require seven layers: hardware, platform, inference engine, model management, API gateway, application services, and user interface
Quantization techniques (FP8, INT4, AWQ) can reduce VRAM requirements by 50–75% with minimal quality loss
The complete stack can be deployed in 4–8 weeks with the right expertise

Why Deploy Your Own LLM?

Cloud AI APIs are the fastest way to prototype. But at production scale, self-hosted LLMs deliver dramatically lower costs, complete data sovereignty, and the freedom to customize models for your specific domain.

According to Lenovo's 2026 TCO analysis, self-hosted inference can be up to 18x cheaper than cloud APIs over three years. Beyond economics, sovereign AI deployment eliminates third-party data exposure — a requirement for regulated industries and a strategic advantage for everyone else.

This guide walks through every step of deploying an LLM on your own infrastructure, from hardware selection through production operations.

Step 1: Select Your Hardware

GPU selection is the foundation of your deployment. The right choice depends on model size, throughput requirements, and budget.

GPU Options

GPU	VRAM	Approx. Cost	Best For
NVIDIA H100 SXM	80 GB	$25,000–$40,000	Enterprise production workloads
NVIDIA H200	141 GB	$30,000–$45,000	Large models (70B+) without multi-GPU
NVIDIA A100	80 GB	$10,000–$15,000	Cost-effective production deployment
NVIDIA RTX 4090	24 GB	$1,600–$2,000	Small models, dev/test, POC
NVIDIA RTX 5090	32 GB	$2,000–$2,500	Small-to-medium models, dev/test
AMD MI300X	192 GB	$10,000–$15,000	Large models, AMD alternative

VRAM Requirements by Model Size

VRAM determines which models you can serve. A rough guide for FP16 (full precision):

Model Size	VRAM Required (FP16)	VRAM with INT4 Quantization	Minimum GPU Config
7B–8B	~16 GB	~4–6 GB	1x RTX 4090
13B	~26 GB	~8–10 GB	1x RTX 4090 or 1x A100
34B	~68 GB	~20 GB	1x A100 80GB
70B	~140 GB	~40 GB	2x A100 or 1x H200
70B (FP8)	~70 GB	—	1x H100 80GB

Sizing Recommendations

For most enterprise use cases, start with a single-node, two-GPU configuration (e.g., 2x A100 80GB or 1x H100). This handles 70B-class models with room for growth. Scale horizontally when throughput demands exceed single-node capacity.

Factor in networking (100 Gbps+ for multi-node), storage (NVMe for model weights — loading a 70B model from spinning disk takes minutes vs. seconds), and power (a single H100 draws ~700W under load).

Step 2: Choose Your Model

Open-source models have reached parity with proprietary alternatives for most enterprise tasks. The key families to evaluate:

Recommended Models (February 2026)

Model Family	Sizes Available	License	Strengths
Llama 3 (Meta)	8B, 70B, 405B	Llama 3 Community	Best general-purpose; strong reasoning and coding
Mistral / Mixtral	7B, 8x7B, 8x22B	Apache 2.0	Excellent efficiency; MoE architecture reduces compute
Qwen 2.5 (Alibaba)	0.5B–72B	Apache 2.0	Strong multilingual; competitive benchmarks
DeepSeek	7B, 67B, MoE	MIT	Competitive performance; cost-efficient MoE
Phi-3/4 (Microsoft)	3.8B, 14B	MIT	Small model performance; edge deployment

Model Selection Criteria

Task fit. General-purpose (Llama 3 70B), coding (Code Llama, DeepSeek Coder), or domain-specific?
VRAM budget. Match model size to available GPU memory, accounting for KV cache overhead.
License. Llama 3 Community License has usage restrictions above 700M monthly active users. Apache 2.0 and MIT are fully permissive.
Quantization tolerance. Models quantized to FP8 or INT4 lose 1–3% benchmark performance but halve VRAM requirements. For most enterprise tasks, the tradeoff is worth it.
Language requirements. If you need strong non-English performance, Qwen 2.5 and Llama 3 lead in multilingual capabilities.

Step 3: Select Your Inference Engine

The inference engine serves your model — it manages GPU memory, batches requests, and exposes an API. This is the most critical software decision in your stack.

Engine Comparison

Engine	Key Feature	Throughput	Best For
vLLM	PagedAttention memory management	~12,500 tok/s (H100, Llama 70B FP8)	Production deployments; widest model support
SGLang	RadixAttention + compiler optimizations	Up to 29% faster than vLLM in some configs	High-throughput production; structured output
Ollama	Single-binary simplicity	Moderate	Development, testing, single-user
llama.cpp	CPU inference, GGUF format	Varies	Edge, CPU-only, resource-constrained
NVIDIA TensorRT-LLM	NVIDIA-optimized kernels	Highest (NVIDIA hardware only)	Maximum throughput on NVIDIA GPUs

Our Recommendation

For production sovereign AI deployments, vLLM is the most mature and widely deployed option — it powers inference at major cloud providers and enterprises globally. It provides an OpenAI-compatible API out of the box, making integration straightforward.

SGLang is the performance leader in recent benchmarks (up to 29% higher throughput) and is increasingly production-ready. Consider it for workloads where throughput is the primary constraint.

Both engines support:

Continuous batching for maximum GPU utilization
Multi-GPU tensor parallelism
Streaming responses
LoRA adapter hot-switching
Prometheus metrics for monitoring
Quantized model serving (FP8, INT4, AWQ, GPTQ)

Step 4: Set Up Your Platform

Production deployments need orchestration, not bare metal scripts.

Container-Based Deployment

Package your inference engine and model into containers:

Base image. Use NVIDIA's CUDA container images as a foundation. They include GPU drivers and CUDA toolkit.
Model weights. Store model weights on a persistent volume. For air-gapped environments, pre-download and verify checksums against official repositories.
Configuration. Parameterize model name, GPU count, quantization method, and serving parameters as environment variables.

Kubernetes Orchestration

For multi-model or multi-replica deployments, Kubernetes with the NVIDIA GPU Operator provides:

GPU scheduling. Kubernetes allocates GPUs to pods automatically via the nvidia.com/gpu resource type.
Horizontal scaling. Add replicas as throughput demands increase.
Health checks. Restart unhealthy inference pods automatically.
Rolling updates. Swap models or engine versions without downtime.
Resource isolation. Prevent one model's workload from starving another.

A typical Kubernetes deployment for vLLM includes:

GPU Operator installation. Install the NVIDIA GPU Operator via Helm, which handles driver management, container toolkit, and device plugin lifecycle automatically.
Persistent volume for model weights. Create a PersistentVolumeClaim backed by NVMe storage. A 70B model in FP16 requires approximately 140 GB; in FP8, approximately 70 GB.
Deployment manifest. Define the inference server pod with GPU resource requests (nvidia.com/gpu: 2 for tensor parallelism across 2 GPUs), memory limits, health probes (vLLM exposes /health by default), and environment variables for model configuration.
Service and Ingress. Expose the inference server internally via a ClusterIP Service, then route through an Ingress controller with TLS termination.
Horizontal Pod Autoscaler (optional). For variable workloads, configure HPA based on custom Prometheus metrics like queue depth or p95 latency. GPU-bound workloads scale by adding complete replicas rather than fractional scaling.

Key Kubernetes considerations for GPU workloads:

GPU pods cannot share GPUs across containers by default — each container gets exclusive access to requested GPUs
Node affinity rules ensure inference pods land on GPU nodes, not CPU-only nodes
Pod disruption budgets prevent Kubernetes from evicting inference pods during rolling updates, maintaining availability
Init containers can pre-download model weights from an internal registry to the persistent volume before the inference container starts

For simpler deployments (single model, limited scale), Docker Compose is sufficient and operationally simpler.

Networking

Set up an API gateway (NGINX, Traefik, or Kong) in front of your inference engine to provide:

TLS termination
Authentication (API keys, OAuth, mTLS)
Rate limiting
Request routing (if serving multiple models)
Load balancing across replicas

Step 5: Build Your Application Layer

The inference engine provides raw model access. The application layer turns it into something useful.

RAG Pipeline

For most enterprise use cases, you will need Retrieval-Augmented Generation to ground model responses in your actual data:

Vector database. Deploy Milvus, Qdrant, or pgvector to store document embeddings. For sovereign deployments, all three are fully self-hosted.
Embedding model. Run a local embedding model (BGE, E5, or Nomic) to convert documents to vectors.
Chunking pipeline. Break documents into searchable segments (256–1,024 tokens). Include metadata for filtering.
Retrieval + generation. Wire the vector database search results into your LLM prompt for grounded responses.

Guardrails and Safety

Production systems need output controls:

Input validation and prompt injection defense
Output filtering for sensitive data patterns (PII, credentials)
Topic boundaries — keep the AI focused on its intended domain
Toxicity and bias detection

API Design

Expose your AI capabilities through a clean, versioned REST API. Match the OpenAI API format where practical — your teams likely already have integrations built for it, and both vLLM and SGLang provide this compatibility out of the box.

Step 6: Implement Security

AI security requires controls beyond traditional application security. See our enterprise AI security checklist for a comprehensive framework.

Priority controls for day one:

Network isolation. Place AI infrastructure in a dedicated VLAN with strict firewall rules. No outbound internet access unless explicitly required.
Encryption. TLS 1.3 for all API traffic. AES-256 for model weights and vector databases at rest.
Access controls. Integrate with your existing IAM. Role-based access to different models and data sources.
Audit logging. Log every query, response, and retrieval operation with user identity and timestamp.
GPU memory management. Clear GPU memory between sessions to prevent data leakage through memory residuals.

Step 7: Monitor and Operate

Production AI is not "deploy and forget." Continuous monitoring ensures quality and catches issues early.

Key Metrics

Category	Metrics	Tools
Inference	Tokens/sec, latency (TTFT, ITL), queue depth, batch size	Prometheus + Grafana (built into vLLM/SGLang)
GPU	Utilization, memory usage, temperature, power draw	NVIDIA DCGM + Prometheus
Quality	Response relevance, faithfulness, hallucination rate	Custom evaluation framework
System	API error rate, uptime, request volume	Standard APM tooling

Building a Monitoring Dashboard

A production Grafana dashboard for LLM inference should include these panels:

Inference Performance:

Time to First Token (TTFT) — p50, p95, p99 — target: <500ms for interactive use cases
Inter-Token Latency (ITL) — target: <30ms for streaming responses
Total generation throughput (tokens/second across all replicas)
Request queue depth — sustained depth >0 indicates capacity pressure

GPU Health:

Per-GPU utilization percentage — sustained <70% suggests over-provisioning; sustained >95% suggests under-provisioning
GPU memory usage vs. allocated — watch for memory leaks in long-running sessions
GPU temperature — throttling typically begins above 83°C
Power consumption — useful for cost tracking and capacity planning

System Health:

Request success rate (target: >99.9%)
Error breakdown by type (OOM, timeout, model error, client error)
Active concurrent requests vs. maximum batch size
Model load time — important for tracking cold starts after restarts

Both vLLM and SGLang expose Prometheus-compatible /metrics endpoints out of the box. NVIDIA DCGM (Data Center GPU Manager) provides GPU-level telemetry. Combine both data sources in Grafana for a unified operational view.

Alerting Rules

Set up alerts for conditions that require immediate attention:

TTFT p95 exceeds 2 seconds for more than 5 minutes (capacity or model issue)
GPU temperature exceeds 85°C (cooling or hardware issue)
Request error rate exceeds 1% over a 5-minute window
GPU memory usage exceeds 95% (risk of OOM crashes)
Queue depth sustained above 10 for more than 10 minutes (need to scale)

Operational Procedures

Model updates. Establish a process for evaluating and deploying new model versions. Test against your evaluation suite before promoting to production. Use blue-green or canary deployment strategies — route 10% of traffic to the new model, compare quality metrics, then promote or roll back.
Scaling. Monitor queue depth and latency. When sustained latency exceeds your SLA, add GPU replicas or upgrade hardware.
Drift detection. Track response quality over time. Degradation may indicate data drift in your RAG pipeline or model degradation.
Backup. Maintain copies of model weights, vector database snapshots, and configuration. Test recovery procedures quarterly.

Architecture Summary

A complete on-premises LLM deployment follows a seven-layer architecture:

Layer	Components	Key Decision
7. User Interface	Chat UI, API clients, workflow integrations	Build vs. use Open WebUI
6. Application Services	RAG pipeline, agents, guardrails	Vector DB selection (Milvus, Qdrant, pgvector)
5. API Gateway	Routing, auth, rate limiting	NGINX, Traefik, or Kong
4. Inference Engine	Model serving, batching, memory management	vLLM or SGLang
3. Model Management	Registry, quantization, version control	MLflow or custom
2. Platform	OS, containers, Kubernetes, GPU drivers	K8s for multi-model, Docker Compose for single
1. Hardware	GPU servers, networking, storage	H100 for production, A100 for cost-conscious

Common Mistakes to Avoid

Over-provisioning hardware. Start with the minimum configuration for your throughput needs. Scale when metrics demand it, not before.
Skipping quantization. FP8 and INT4 quantization cuts VRAM by 50–75% with 1–3% quality loss. Always evaluate quantized models first.
Ignoring the RAG pipeline. A raw LLM without access to your data has limited enterprise value. Budget time for RAG development.
No evaluation framework. Without systematic testing, you cannot measure whether changes improve or degrade quality. Build evaluation into your process from day one.
Treating it as a one-time project. AI systems require ongoing monitoring, model updates, and optimization. Budget for operations, not just deployment.

Troubleshooting Common Issues

Out of Memory (OOM) Crashes

Symptom: Inference server crashes or restarts under load; logs show CUDA OOM errors.

Causes and fixes:

KV cache overflow. The KV cache grows with concurrent requests and sequence length. Reduce --max-num-seqs (vLLM) or --max-total-tokens (SGLang) to cap concurrent request memory usage.
Model too large for available VRAM. Apply quantization (FP8 or INT4) to reduce memory footprint. A 70B FP16 model requires ~140 GB; in FP8, it fits in ~70 GB.
Memory fragmentation. vLLM's PagedAttention reduces fragmentation, but very long sequences (32K+ tokens) can still cause issues. Set a maximum context length that matches your actual needs, not the model's maximum.

Slow Time to First Token (TTFT)

Symptom: Users experience multi-second delays before the first token appears.

Causes and fixes:

Model loading from disk. Ensure model weights are on NVMe storage, not HDD. Pre-load models at pod startup rather than on first request.
Insufficient tensor parallelism. For 70B+ models, split across 2+ GPUs to parallelize the prefill computation.
Long input prompts. Prefill time scales with input length. If TTFT is slow only for long prompts, this is expected behavior — optimize by reducing context length or using shorter system prompts.

Inconsistent Response Quality

Symptom: Model outputs vary in quality or relevance compared to cloud API baseline.

Causes and fixes:

Quantization artifacts. Compare outputs at FP16 vs. your quantized format. If quality degrades noticeably, try FP8 instead of INT4, or use a higher-quality quantization method (AWQ over GPTQ).
Temperature and sampling mismatch. Ensure your inference engine's default sampling parameters match your baseline. Different defaults for temperature, top_p, and repetition_penalty can produce dramatically different outputs.
Missing system prompt. If you were using a cloud API's built-in system prompt behavior, ensure your self-hosted setup replicates it.

Cost Estimation Worksheet

Use this framework to estimate your total cost of ownership for a self-hosted LLM deployment:

Hardware Costs (One-Time)

Component	Budget Option	Production Option	Enterprise Option
GPU Server	1x A100 80GB ($12K)	2x A100 80GB ($22K)	2x H100 SXM ($60K)
Networking	10 GbE ($500)	25 GbE ($2K)	100 GbE InfiniBand ($8K)
Storage	2TB NVMe ($200)	4TB NVMe ($400)	8TB NVMe RAID ($1.5K)
Total Hardware	~$13K	~$25K	~$70K

Ongoing Costs (Monthly)

Category	Estimate	Notes
Power	$150–$500	1–4 GPUs at $0.10–$0.15/kWh, ~700W per H100
Cooling	$50–$200	Proportional to power; less in cooler climates
Operations	$0–$5,000	Depends on team structure; can be absorbed by existing infra team
Software	$0	vLLM, SGLang, and open-source models are free
Total Monthly	$200–$5,700

Break-Even vs. Cloud APIs

Usage Level	Monthly Cloud Cost	Monthly Self-Hosted	Break-Even
Light (1K queries/day)	$3K–$8K	$200–$500	2–5 months
Medium (5K queries/day)	$15K–$40K	$500–$2K	1–2 months
Heavy (25K queries/day)	$75K–$200K	$2K–$6K	2–4 weeks

The calculus is clear: at any meaningful enterprise usage level, self-hosted inference pays for itself within months.

Getting Started

The fastest path from zero to production:

Week 1–2: Procure hardware (or provision existing GPU resources). Select model and inference engine.
Week 3–4: Deploy inference engine with selected model. Set up API gateway, authentication, and basic monitoring.
Week 5–6: Build RAG pipeline if needed. Connect to initial data sources. Implement security controls.
Week 7–8: Integration testing, load testing, security hardening. Deploy to pilot users.
Week 9+: Gather feedback, iterate on quality, expand to additional use cases.

Self-hosted LLM deployment is a solved problem in 2026. The open-source ecosystem — models, inference engines, vector databases, and orchestration tools — provides every component needed for production-grade sovereign AI. The primary barrier is not technology but expertise in assembling and operating the stack.

Want help deploying an LLM on your infrastructure? Let's talk.