Skip to content

Agent Infrastructure

The systems that make inference economical at scale. A model that works in a notebook will not work in production without this layer: serving, caching, routing, and observability each carry their own tradeoffs.

The Production Decision

Infrastructure choices are primarily cost and reliability decisions:

  • Managed API or self-hosted inference — who owns the SLA?
  • Do you have repeated context that justifies caching?
  • Does your workload have enough model diversity to benefit from routing?
  • Can you actually observe what your agents are doing and what they cost?

Model Serving

Turning a model file (or API key) into a production inference endpoint.

Managed Inference

Pay per token, zero ops burden. The right default for most products:

ProviderStrengthsWeaknesses
OpenAILargest ecosystem, fast model/tool iteration, strong enterprise controlsPremium frontier pricing, regional processing depends on project eligibility and region support
AnthropicStrong instruction following, explicit prompt caching, long-context optionsSmaller model selection, some long-context options are tier-gated or beta
Google VertexGCP integration, strong multimodal and long-context optionsProduct surface split across Vertex and Gemini API, model-specific quotas vary
Together AIOpen-weight model catalogLess predictable availability
Fireworks AIFast open-weight inferenceSmaller model selection
GroqExtremely fast TTFT (LPU hardware)Limited models, no fine-tuning

Self-Hosted Inference

Own the hardware, own the SLA. Worth considering when:

  • You process >10M tokens/day and managed costs are prohibitive
  • Data sovereignty or compliance requires on-premise
  • You need fine-tuned models managed APIs do not support
StackBest for
vLLMGeneral-purpose serving, strong throughput, PagedAttention
SGLangHigh concurrency with structured generation
TGI (Text Generation Inference)Hugging Face ecosystem, quick setup
TensorRT-LLMMaximum GPU throughput on NVIDIA hardware
llama.cppCPU/Mac inference, GGUF models

LLM Serving →

Prompt Caching

Reusing the KV cache from previous requests. The highest-leverage infrastructure optimization for agents with repeated context.

When it applies

  • System prompts — Same instructions sent with every request. Cache them.
  • Few-shot examples — Static examples in every prompt. Cache them.
  • RAG context — Documents retrieved for a query. Cache if the retrieved docs are stable across requests.
  • Long conversation history — As history grows, older turns can be cached.

Savings by provider

Current as of May 2026. Verify cache pricing and availability against provider docs before committing to a platform.

ProviderCache pricingMin cacheable tokens
AnthropicWrite: 1.25× input; Read: 0.1× input1,024
OpenAIAutomatic; cached input pricing is model-specific (for example, 0.1× on GPT-5.5 and 0.25× on GPT-4.1)1,024
GoogleImplicit caching on supported models; explicit caching adds storage costModel-specific; commonly 1,024-4,096
DeepSeekRead: 0.1× input64

The break-even on Anthropic caching: if a cached prefix is reused more than ~1.2 times, you save money. For production agents with stable system prompts, the first request pays for all subsequent ones.

Prompt Caching →

Routing

Directing each request to the right model based on task characteristics. Done well, routing cuts costs by 40–70% with no quality loss on the simple majority of requests.

Routing strategies

Complexity-based — Classify requests as simple or complex before calling an LLM. Simple → fast cheap model (DeepSeek V4-Flash, Gemini Flash). Complex → frontier model.

Cost-capped — Set a per-request token budget. If the request fits, use a cheap model. If not, escalate.

Cascade — Call a cheap model first. If it returns a low-confidence response, re-run with a stronger model.

Capability-based — Route by what the task requires: vision → multimodal model, code → code-optimized model, reasoning → reasoning model.

What to measure

Route quality degrades silently. You need evals on the routed distribution, not just the full-model baseline. If routing sends 70% of traffic to a cheap model and that model fails on 10% of those cases, you have a real quality problem that aggregate metrics will not surface.

Evaluations → | Latency →

Observability

You cannot optimize what you cannot see. At scale, inference is a cost center with hidden structure — the same agent call can vary 10× in cost depending on cache state, model selection, and context length.

What to instrument

SignalWhy it matters
Tokens in / tokens out per callPrimary cost driver
Cache hit rateMeasures caching effectiveness
Time to first token (TTFT)Perceived latency for streaming
Total generation timeBatch job throughput
Model used per requestRouting effectiveness
Error rate by modelReliability comparison
Cost per user sessionUnit economics signal

Tooling

  • LangSmith — Tracing for LangChain-based agents; native span tracking
  • Langfuse — Open-source, self-hostable, model-agnostic
  • Helicone — Proxy-based; works with any OpenAI-compatible API
  • Braintrust — Combines tracing and evals in one platform
  • Custom span logging — For non-framework agents, structured JSON logs with trace IDs are often enough

The minimum viable setup: log every LLM call with model, token counts, latency, and a session ID. That alone tells you 80% of what you need to know.

Rate Limits & Concurrency →

Production Reality

Managed APIs have outages. OpenAI, Anthropic, and Google have all had multi-hour incidents. If your product depends on a single provider, design a fallback model or a graceful degradation path. The fallback does not need to be equivalent — it just needs to keep the product functional.

Caching requires stable prefixes. If your system prompt includes timestamps, user IDs, or any dynamic content before the cacheable block, you will never get a cache hit. Structure prompts so all static content comes first, dynamic content last.

Routing adds latency at the classification step. A classifier that takes 200ms to route a request has already spent 200ms before inference starts. Use lightweight classifiers — small local models, keyword heuristics, or routing by request metadata — rather than full LLM calls.

Observability is non-negotiable for billing disputes. At scale, provider invoices will occasionally be wrong or surprising. Your own token logs are the only way to reconcile them.

Released under the MIT License.