LLM Serving
How to turn a model into a reliable API endpoint. The right choice depends on whether you need control, cost optimization, data privacy, or just the fastest path to production.
The Decision
| Situation | Recommendation |
|---|---|
| Fastest path to production | Managed API (OpenAI, Anthropic, Google) |
| Cost optimization at high volume | Self-hosted open models |
| Data residency / privacy requirements | Self-hosted or VPC-deployed managed API |
| Maximum inference speed | Groq (LPU) or Cerebras |
| Broadest open model selection | Together AI or Fireworks AI |
| Local development / no network | Ollama + llama.cpp |
Managed API Providers
No infrastructure to operate. You pay per token.
Foundation Model APIs
Current as of May 2026. Provider strengths and price/performance positioning shift quickly; verify the exact model lineup, pricing mode, and enterprise controls before choosing a serving provider.
| Provider | Models | Strength |
|---|---|---|
| Anthropic | Claude Haiku 4.5, Sonnet 4.6, Opus 4.7 | Best instruction following, long context, tool use |
| OpenAI | GPT-5.5, o3, o4-mini | Widest ecosystem, structured output, vision |
| Gemini 3.1 Pro, Gemini 2.5 Flash | Largest context window (1M), multimodal | |
| DeepSeek | DeepSeek V4-Pro, V4-Flash | Aggressive pricing, strong coding, OpenAI/Anthropic-compatible API surface |
| Mistral | Mistral Large 3, Mistral Small | European provider, competitive pricing |
Inference-Speed Providers
These run open-source models but optimize for throughput and latency using custom hardware or kernels.
Groq — Language Processing Unit (LPU) hardware. 200–800 tokens/second on LLaMA/Gemma/Llama 4 models. The fastest option for latency-sensitive workloads. Limited model selection; no fine-tuned model support.
Cerebras — Wafer Scale Engine silicon. ~2,000 tokens/second on supported models. The absolute fastest option. Very limited model catalog; no fine-tuning; capacity-constrained.
Together AI — 200+ open-source models, fine-tuning API, dedicated GPU instances. Good for teams needing flexibility across many models or custom fine-tuned variants.
Fireworks AI — Custom CUDA kernels (FireAttention). Fastest structured output / JSON mode in the market. Good for production apps relying heavily on function calling and constrained generation.
Self-Hosted Serving
You run the model on your own GPU infrastructure. Higher operational overhead; lower per-token cost at scale; full control over data.
Production Server Frameworks
vLLM — the production standard for high-throughput serving.
- PagedAttention: near-zero KV cache waste, 2–4× better GPU utilization vs. naïve serving
- OpenAI-compatible REST API (drop-in replacement)
- Tensor parallelism for multi-GPU; speculative decoding; continuous batching
- Best for: GPU clusters, multi-tenant serving, high-concurrency deployments
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 32768SGLang — highest throughput for structured generation and shared-prefix workloads.
- RadixAttention: caches and reuses KV for shared prompt prefixes across requests
- Native constrained decoding (JSON schema, regex)
- Best for: RAG systems with fixed retrieval context, few-shot workloads, structured output pipelines
TGI (Text Generation Inference) — Hugging Face's production server.
- Deep HuggingFace Hub integration; pull and serve any Hub model
- Continuous batching, tensor parallelism, GPTQ/AWQ support
- Best for: Teams already on HuggingFace ecosystem; fast path from Hub model to API
TensorRT-LLM — NVIDIA's maximum-performance serving stack.
- Ahead-of-time TensorRT compilation: maximum GPU utilization
- FP8/INT4 kernels; in-flight batching
- Best for: Enterprise deployments requiring maximum throughput on NVIDIA H100/A100 fleets
- Tradeoff: complex build pipeline; slow iteration; NVIDIA-only
Local / Edge Serving
llama.cpp — runs GGUF-quantized models on CPU, GPU, or Apple Silicon.
- Minimal dependencies, universal hardware support
- Best for: Local development, edge/embedded, research prototyping on consumer hardware
- Tradeoff: lower throughput than CUDA-optimized servers at scale
Ollama — wraps llama.cpp with model management and OpenAI-compatible API.
ollama pull llama3/ollama run llama3— Docker-like model management- Best for: Developer local dev, desktop apps, Mac/Linux personal use
- Tradeoff: inherits llama.cpp throughput ceilings; limited multi-user concurrency
mlx-lm — Apple's MLX framework for Apple Silicon inference.
- Uses unified memory (CPU+GPU shared); optimized for M-series chips
- Best for: Mac developers running models locally without CUDA
Key Metrics
Throughput — total tokens generated per second across all concurrent requests. The metric that matters for batch workloads and shared-serve efficiency.
TTFT (Time to First Token) — delay from request to first output token. The metric that matters for interactive, streaming applications.
TPOT (Time Per Output Token) — milliseconds per token after the first. Determines how fast a response streams.
GPU utilization — idle GPU is expensive GPU. vLLM's PagedAttention and continuous batching exist specifically to keep utilization high across variable-length requests.
See Latency for the full breakdown.
Model Context Protocol (MCP)
Not a serving framework — a protocol for connecting models to tools and data sources. An MCP server exposes tools (functions), resources (data), and prompts; an MCP client (your agent or IDE) connects to it.
Introduced by Anthropic in November 2024. As of May 2026, MCP is the main interoperability direction for tool integration, but support depth still varies by client:
- supported in first-party or visible form across Claude and the OpenAI Agents SDK
- visible in major editor and agent clients such as VS Code Agent mode and Cursor
- broad community server ecosystem for GitHub, Slack, Postgres, filesystem, browser control, and more
For self-hosted serving, you can expose your model as an MCP-compatible server or connect it to MCP tool servers — decoupling model serving from tool integration.
Build vs. Buy Decision
| Factor | Managed API | Self-hosted |
|---|---|---|
| Setup time | Hours | Days–weeks |
| Ops burden | None | Significant |
| Cost at 1B tokens/month | Higher | Lower |
| Latency | Provider-dependent | Tunable |
| Data privacy | Depends on provider | Full control |
| Model customization | Fine-tuning API only | Full control |
| Scaling | Automatic | Manual |
The break-even point between managed API and self-hosted depends heavily on GPU costs in your region and your traffic profile. At fewer than ~100M tokens/month, managed APIs are almost always cheaper when you factor in engineering and operations time.
Production Reality
Managed APIs have gotten significantly cheaper — the cost argument for self-hosting has narrowed. But do not anchor on a temporary promo price when making a build-vs-buy decision. Use full-rate pricing and expected request shape, not launch discounts.
GPU availability is a constraint — H100s and A100s are still capacity-constrained. Self-hosting planning that assumes on-demand GPU availability will be disappointed.
Cold start latency on serverless — if you deploy a self-hosted model on serverless GPU infrastructure (Modal, Replicate, RunPod serverless), the first request after idle triggers a model load. This can be 30–120 seconds. Keep the model warm with periodic requests or use reserved capacity for latency-sensitive workloads.
Benchmark your serving stack — published throughput numbers for vLLM, SGLang, and TGI are measured under specific conditions (batch size, sequence length, model size, GPU type) that may not match your workload. Benchmark under your actual traffic shape before choosing a framework.
Related Topics
- KV Cache & Quantization — for the memory constraints that drive serving design
- Latency — for the user-facing effect of serving stack choices
- Rate Limits & Concurrency — for the throughput ceilings you still hit even with a fast serving layer