·Updated ·16 min read·Tilkal Team

How to Deploy an LLM On-Premises: A Step-by-Step Guide

A practical guide to deploying large language models on your own infrastructure. Covers hardware selection, model choice, inference engines, orchestration, and production operations.

LLM DeploymentOn-Premises AIInfrastructureTutorialSovereign AI

Key Takeaways:

  • A single NVIDIA H100 GPU ($25,000–$40,000) can serve Llama 3 70B at over 12,500 tokens per second
  • vLLM and SGLang are the leading open-source inference engines, with SGLang achieving up to 29% higher throughput in some benchmarks
  • Production deployments require seven layers: hardware, platform, inference engine, model management, API gateway, application services, and user interface
  • Quantization techniques (FP8, INT4, AWQ) can reduce VRAM requirements by 50–75% with minimal quality loss
  • The complete stack can be deployed in 4–8 weeks with the right expertise

Why Deploy Your Own LLM?

Cloud AI APIs are the fastest way to prototype. But at production scale, self-hosted LLMs deliver dramatically lower costs, complete data sovereignty, and the freedom to customize models for your specific domain.

According to Lenovo's 2026 TCO analysis, self-hosted inference can be up to 18x cheaper than cloud APIs over three years. Beyond economics, sovereign AI deployment eliminates third-party data exposure — a requirement for regulated industries and a strategic advantage for everyone else.

This guide walks through every step of deploying an LLM on your own infrastructure, from hardware selection through production operations.

Step 1: Select Your Hardware

GPU selection is the foundation of your deployment. The right choice depends on model size, throughput requirements, and budget.

GPU Options

GPUVRAMApprox. CostBest For
NVIDIA H100 SXM80 GB$25,000–$40,000Enterprise production workloads
NVIDIA H200141 GB$30,000–$45,000Large models (70B+) without multi-GPU
NVIDIA A10080 GB$10,000–$15,000Cost-effective production deployment
NVIDIA RTX 409024 GB$1,600–$2,000Small models, dev/test, POC
NVIDIA RTX 509032 GB$2,000–$2,500Small-to-medium models, dev/test
AMD MI300X192 GB$10,000–$15,000Large models, AMD alternative

VRAM Requirements by Model Size

VRAM determines which models you can serve. A rough guide for FP16 (full precision):

Model SizeVRAM Required (FP16)VRAM with INT4 QuantizationMinimum GPU Config
7B–8B~16 GB~4–6 GB1x RTX 4090
13B~26 GB~8–10 GB1x RTX 4090 or 1x A100
34B~68 GB~20 GB1x A100 80GB
70B~140 GB~40 GB2x A100 or 1x H200
70B (FP8)~70 GB1x H100 80GB

Sizing Recommendations

For most enterprise use cases, start with a single-node, two-GPU configuration (e.g., 2x A100 80GB or 1x H100). This handles 70B-class models with room for growth. Scale horizontally when throughput demands exceed single-node capacity.

Factor in networking (100 Gbps+ for multi-node), storage (NVMe for model weights — loading a 70B model from spinning disk takes minutes vs. seconds), and power (a single H100 draws ~700W under load).

Step 2: Choose Your Model

Open-source models have reached parity with proprietary alternatives for most enterprise tasks. The key families to evaluate:

Recommended Models (February 2026)

Model FamilySizes AvailableLicenseStrengths
Llama 3 (Meta)8B, 70B, 405BLlama 3 CommunityBest general-purpose; strong reasoning and coding
Mistral / Mixtral7B, 8x7B, 8x22BApache 2.0Excellent efficiency; MoE architecture reduces compute
Qwen 2.5 (Alibaba)0.5B–72BApache 2.0Strong multilingual; competitive benchmarks
DeepSeek7B, 67B, MoEMITCompetitive performance; cost-efficient MoE
Phi-3/4 (Microsoft)3.8B, 14BMITSmall model performance; edge deployment

Model Selection Criteria

  1. Task fit. General-purpose (Llama 3 70B), coding (Code Llama, DeepSeek Coder), or domain-specific?
  2. VRAM budget. Match model size to available GPU memory, accounting for KV cache overhead.
  3. License. Llama 3 Community License has usage restrictions above 700M monthly active users. Apache 2.0 and MIT are fully permissive.
  4. Quantization tolerance. Models quantized to FP8 or INT4 lose 1–3% benchmark performance but halve VRAM requirements. For most enterprise tasks, the tradeoff is worth it.
  5. Language requirements. If you need strong non-English performance, Qwen 2.5 and Llama 3 lead in multilingual capabilities.

Step 3: Select Your Inference Engine

The inference engine serves your model — it manages GPU memory, batches requests, and exposes an API. This is the most critical software decision in your stack.

Engine Comparison

EngineKey FeatureThroughputBest For
vLLMPagedAttention memory management~12,500 tok/s (H100, Llama 70B FP8)Production deployments; widest model support
SGLangRadixAttention + compiler optimizationsUp to 29% faster than vLLM in some configsHigh-throughput production; structured output
OllamaSingle-binary simplicityModerateDevelopment, testing, single-user
llama.cppCPU inference, GGUF formatVariesEdge, CPU-only, resource-constrained
NVIDIA TensorRT-LLMNVIDIA-optimized kernelsHighest (NVIDIA hardware only)Maximum throughput on NVIDIA GPUs

Our Recommendation

For production sovereign AI deployments, vLLM is the most mature and widely deployed option — it powers inference at major cloud providers and enterprises globally. It provides an OpenAI-compatible API out of the box, making integration straightforward.

SGLang is the performance leader in recent benchmarks (up to 29% higher throughput) and is increasingly production-ready. Consider it for workloads where throughput is the primary constraint.

Both engines support:

  • Continuous batching for maximum GPU utilization
  • Multi-GPU tensor parallelism
  • Streaming responses
  • LoRA adapter hot-switching
  • Prometheus metrics for monitoring
  • Quantized model serving (FP8, INT4, AWQ, GPTQ)

Step 4: Set Up Your Platform

Production deployments need orchestration, not bare metal scripts.

Container-Based Deployment

Package your inference engine and model into containers:

  1. Base image. Use NVIDIA's CUDA container images as a foundation. They include GPU drivers and CUDA toolkit.
  2. Model weights. Store model weights on a persistent volume. For air-gapped environments, pre-download and verify checksums against official repositories.
  3. Configuration. Parameterize model name, GPU count, quantization method, and serving parameters as environment variables.

Kubernetes Orchestration

For multi-model or multi-replica deployments, Kubernetes with the NVIDIA GPU Operator provides:

  • GPU scheduling. Kubernetes allocates GPUs to pods automatically via the nvidia.com/gpu resource type.
  • Horizontal scaling. Add replicas as throughput demands increase.
  • Health checks. Restart unhealthy inference pods automatically.
  • Rolling updates. Swap models or engine versions without downtime.
  • Resource isolation. Prevent one model's workload from starving another.

A typical Kubernetes deployment for vLLM includes:

  1. GPU Operator installation. Install the NVIDIA GPU Operator via Helm, which handles driver management, container toolkit, and device plugin lifecycle automatically.
  2. Persistent volume for model weights. Create a PersistentVolumeClaim backed by NVMe storage. A 70B model in FP16 requires approximately 140 GB; in FP8, approximately 70 GB.
  3. Deployment manifest. Define the inference server pod with GPU resource requests (nvidia.com/gpu: 2 for tensor parallelism across 2 GPUs), memory limits, health probes (vLLM exposes /health by default), and environment variables for model configuration.
  4. Service and Ingress. Expose the inference server internally via a ClusterIP Service, then route through an Ingress controller with TLS termination.
  5. Horizontal Pod Autoscaler (optional). For variable workloads, configure HPA based on custom Prometheus metrics like queue depth or p95 latency. GPU-bound workloads scale by adding complete replicas rather than fractional scaling.

Key Kubernetes considerations for GPU workloads:

  • GPU pods cannot share GPUs across containers by default — each container gets exclusive access to requested GPUs
  • Node affinity rules ensure inference pods land on GPU nodes, not CPU-only nodes
  • Pod disruption budgets prevent Kubernetes from evicting inference pods during rolling updates, maintaining availability
  • Init containers can pre-download model weights from an internal registry to the persistent volume before the inference container starts

For simpler deployments (single model, limited scale), Docker Compose is sufficient and operationally simpler.

Networking

Set up an API gateway (NGINX, Traefik, or Kong) in front of your inference engine to provide:

  • TLS termination
  • Authentication (API keys, OAuth, mTLS)
  • Rate limiting
  • Request routing (if serving multiple models)
  • Load balancing across replicas

Step 5: Build Your Application Layer

The inference engine provides raw model access. The application layer turns it into something useful.

RAG Pipeline

For most enterprise use cases, you will need Retrieval-Augmented Generation to ground model responses in your actual data:

  1. Vector database. Deploy Milvus, Qdrant, or pgvector to store document embeddings. For sovereign deployments, all three are fully self-hosted.
  2. Embedding model. Run a local embedding model (BGE, E5, or Nomic) to convert documents to vectors.
  3. Chunking pipeline. Break documents into searchable segments (256–1,024 tokens). Include metadata for filtering.
  4. Retrieval + generation. Wire the vector database search results into your LLM prompt for grounded responses.

Guardrails and Safety

Production systems need output controls:

  • Input validation and prompt injection defense
  • Output filtering for sensitive data patterns (PII, credentials)
  • Topic boundaries — keep the AI focused on its intended domain
  • Toxicity and bias detection

API Design

Expose your AI capabilities through a clean, versioned REST API. Match the OpenAI API format where practical — your teams likely already have integrations built for it, and both vLLM and SGLang provide this compatibility out of the box.

Step 6: Implement Security

AI security requires controls beyond traditional application security. See our enterprise AI security checklist for a comprehensive framework.

Priority controls for day one:

  1. Network isolation. Place AI infrastructure in a dedicated VLAN with strict firewall rules. No outbound internet access unless explicitly required.
  2. Encryption. TLS 1.3 for all API traffic. AES-256 for model weights and vector databases at rest.
  3. Access controls. Integrate with your existing IAM. Role-based access to different models and data sources.
  4. Audit logging. Log every query, response, and retrieval operation with user identity and timestamp.
  5. GPU memory management. Clear GPU memory between sessions to prevent data leakage through memory residuals.

Step 7: Monitor and Operate

Production AI is not "deploy and forget." Continuous monitoring ensures quality and catches issues early.

Key Metrics

CategoryMetricsTools
InferenceTokens/sec, latency (TTFT, ITL), queue depth, batch sizePrometheus + Grafana (built into vLLM/SGLang)
GPUUtilization, memory usage, temperature, power drawNVIDIA DCGM + Prometheus
QualityResponse relevance, faithfulness, hallucination rateCustom evaluation framework
SystemAPI error rate, uptime, request volumeStandard APM tooling

Building a Monitoring Dashboard

A production Grafana dashboard for LLM inference should include these panels:

Inference Performance:

  • Time to First Token (TTFT) — p50, p95, p99 — target: <500ms for interactive use cases
  • Inter-Token Latency (ITL) — target: <30ms for streaming responses
  • Total generation throughput (tokens/second across all replicas)
  • Request queue depth — sustained depth >0 indicates capacity pressure

GPU Health:

  • Per-GPU utilization percentage — sustained <70% suggests over-provisioning; sustained >95% suggests under-provisioning
  • GPU memory usage vs. allocated — watch for memory leaks in long-running sessions
  • GPU temperature — throttling typically begins above 83°C
  • Power consumption — useful for cost tracking and capacity planning

System Health:

  • Request success rate (target: >99.9%)
  • Error breakdown by type (OOM, timeout, model error, client error)
  • Active concurrent requests vs. maximum batch size
  • Model load time — important for tracking cold starts after restarts

Both vLLM and SGLang expose Prometheus-compatible /metrics endpoints out of the box. NVIDIA DCGM (Data Center GPU Manager) provides GPU-level telemetry. Combine both data sources in Grafana for a unified operational view.

Alerting Rules

Set up alerts for conditions that require immediate attention:

  • TTFT p95 exceeds 2 seconds for more than 5 minutes (capacity or model issue)
  • GPU temperature exceeds 85°C (cooling or hardware issue)
  • Request error rate exceeds 1% over a 5-minute window
  • GPU memory usage exceeds 95% (risk of OOM crashes)
  • Queue depth sustained above 10 for more than 10 minutes (need to scale)

Operational Procedures

  • Model updates. Establish a process for evaluating and deploying new model versions. Test against your evaluation suite before promoting to production. Use blue-green or canary deployment strategies — route 10% of traffic to the new model, compare quality metrics, then promote or roll back.
  • Scaling. Monitor queue depth and latency. When sustained latency exceeds your SLA, add GPU replicas or upgrade hardware.
  • Drift detection. Track response quality over time. Degradation may indicate data drift in your RAG pipeline or model degradation.
  • Backup. Maintain copies of model weights, vector database snapshots, and configuration. Test recovery procedures quarterly.

Architecture Summary

A complete on-premises LLM deployment follows a seven-layer architecture:

LayerComponentsKey Decision
7. User InterfaceChat UI, API clients, workflow integrationsBuild vs. use Open WebUI
6. Application ServicesRAG pipeline, agents, guardrailsVector DB selection (Milvus, Qdrant, pgvector)
5. API GatewayRouting, auth, rate limitingNGINX, Traefik, or Kong
4. Inference EngineModel serving, batching, memory managementvLLM or SGLang
3. Model ManagementRegistry, quantization, version controlMLflow or custom
2. PlatformOS, containers, Kubernetes, GPU driversK8s for multi-model, Docker Compose for single
1. HardwareGPU servers, networking, storageH100 for production, A100 for cost-conscious

Common Mistakes to Avoid

  1. Over-provisioning hardware. Start with the minimum configuration for your throughput needs. Scale when metrics demand it, not before.
  2. Skipping quantization. FP8 and INT4 quantization cuts VRAM by 50–75% with 1–3% quality loss. Always evaluate quantized models first.
  3. Ignoring the RAG pipeline. A raw LLM without access to your data has limited enterprise value. Budget time for RAG development.
  4. No evaluation framework. Without systematic testing, you cannot measure whether changes improve or degrade quality. Build evaluation into your process from day one.
  5. Treating it as a one-time project. AI systems require ongoing monitoring, model updates, and optimization. Budget for operations, not just deployment.

Troubleshooting Common Issues

Out of Memory (OOM) Crashes

Symptom: Inference server crashes or restarts under load; logs show CUDA OOM errors.

Causes and fixes:

  • KV cache overflow. The KV cache grows with concurrent requests and sequence length. Reduce --max-num-seqs (vLLM) or --max-total-tokens (SGLang) to cap concurrent request memory usage.
  • Model too large for available VRAM. Apply quantization (FP8 or INT4) to reduce memory footprint. A 70B FP16 model requires ~140 GB; in FP8, it fits in ~70 GB.
  • Memory fragmentation. vLLM's PagedAttention reduces fragmentation, but very long sequences (32K+ tokens) can still cause issues. Set a maximum context length that matches your actual needs, not the model's maximum.

Slow Time to First Token (TTFT)

Symptom: Users experience multi-second delays before the first token appears.

Causes and fixes:

  • Model loading from disk. Ensure model weights are on NVMe storage, not HDD. Pre-load models at pod startup rather than on first request.
  • Insufficient tensor parallelism. For 70B+ models, split across 2+ GPUs to parallelize the prefill computation.
  • Long input prompts. Prefill time scales with input length. If TTFT is slow only for long prompts, this is expected behavior — optimize by reducing context length or using shorter system prompts.

Inconsistent Response Quality

Symptom: Model outputs vary in quality or relevance compared to cloud API baseline.

Causes and fixes:

  • Quantization artifacts. Compare outputs at FP16 vs. your quantized format. If quality degrades noticeably, try FP8 instead of INT4, or use a higher-quality quantization method (AWQ over GPTQ).
  • Temperature and sampling mismatch. Ensure your inference engine's default sampling parameters match your baseline. Different defaults for temperature, top_p, and repetition_penalty can produce dramatically different outputs.
  • Missing system prompt. If you were using a cloud API's built-in system prompt behavior, ensure your self-hosted setup replicates it.

Cost Estimation Worksheet

Use this framework to estimate your total cost of ownership for a self-hosted LLM deployment:

Hardware Costs (One-Time)

ComponentBudget OptionProduction OptionEnterprise Option
GPU Server1x A100 80GB ($12K)2x A100 80GB ($22K)2x H100 SXM ($60K)
Networking10 GbE ($500)25 GbE ($2K)100 GbE InfiniBand ($8K)
Storage2TB NVMe ($200)4TB NVMe ($400)8TB NVMe RAID ($1.5K)
Total Hardware~$13K~$25K~$70K

Ongoing Costs (Monthly)

CategoryEstimateNotes
Power$150–$5001–4 GPUs at $0.10–$0.15/kWh, ~700W per H100
Cooling$50–$200Proportional to power; less in cooler climates
Operations$0–$5,000Depends on team structure; can be absorbed by existing infra team
Software$0vLLM, SGLang, and open-source models are free
Total Monthly$200–$5,700

Break-Even vs. Cloud APIs

Usage LevelMonthly Cloud CostMonthly Self-HostedBreak-Even
Light (1K queries/day)$3K–$8K$200–$5002–5 months
Medium (5K queries/day)$15K–$40K$500–$2K1–2 months
Heavy (25K queries/day)$75K–$200K$2K–$6K2–4 weeks

The calculus is clear: at any meaningful enterprise usage level, self-hosted inference pays for itself within months.

Getting Started

The fastest path from zero to production:

  1. Week 1–2: Procure hardware (or provision existing GPU resources). Select model and inference engine.
  2. Week 3–4: Deploy inference engine with selected model. Set up API gateway, authentication, and basic monitoring.
  3. Week 5–6: Build RAG pipeline if needed. Connect to initial data sources. Implement security controls.
  4. Week 7–8: Integration testing, load testing, security hardening. Deploy to pilot users.
  5. Week 9+: Gather feedback, iterate on quality, expand to additional use cases.

Self-hosted LLM deployment is a solved problem in 2026. The open-source ecosystem — models, inference engines, vector databases, and orchestration tools — provides every component needed for production-grade sovereign AI. The primary barrier is not technology but expertise in assembling and operating the stack.


Want help deploying an LLM on your infrastructure? Let's talk.