Agent Applications

The user-facing layer of the stack. This is where the engineering choices made in the three layers below become visible — or invisible, when they work. The decisions here are primarily about what you own versus what you buy, and how you maintain quality once users are touching it.

The Production Decision

Application-layer decisions are build vs buy tradeoffs with compounding cost implications:

How much of the runtime, infrastructure, and model selection do you control?
What does quality mean for your use case, and how do you measure it continuously?
What is your latency budget, and where is the UX cliff?
What happens when the agent is wrong — and how often will it be?

Application Categories

Coding Agents

Autonomous code generation, editing, debugging, and test writing. The highest-productivity application type; also the highest-risk for incorrect output.

Key characteristics:

Require strong tool use: file read/write, terminal execution, test runner
Context grows fast — open files, diffs, error traces accumulate quickly
Output can be verified mechanically (does the code run? do tests pass?)
Errors are often self-correctable if the agent can see the failure output

Examples: Claude Code, GitHub Copilot Workspace, Cursor, Devin

Personal AI Assistants

General-purpose agents with memory across sessions, often multimodal. The broadest category; the hardest to evaluate.

Key characteristics:

Long-term memory is expected — users want the agent to remember prior interactions
Multi-turn conversation requires careful context management
Scope is unbounded — users will ask anything
Quality is subjective and hard to measure at scale

Examples: Claude.ai, ChatGPT, Perplexity, Kimi

Support and Domain Bots

Agents scoped to a specific knowledge domain — customer support, internal helpdesks, compliance tools. Narrower scope makes evaluation tractable.

Key characteristics:

Knowledge retrieval is the core capability (RAG)
Hallucination on factual claims is the primary failure mode
Escalation paths are required — the agent must know when to hand off to a human
Latency expectations are tight (users want fast answers)

Research and Synthesis Agents

Agents that gather information autonomously — browsing, reading documents, compiling reports.

Key characteristics:

Require browsing and document ingestion tools
Output quality is hard to verify without domain expertise
Source attribution matters — users need to trust what they read
Hallucinated citations are a catastrophic failure mode

Build vs Buy

The stack layers below can be owned at different depths:

Layer	Buy	Build when
Foundation model	Always (start with API)	Never at application scale
Inference serving	Use managed API	>10M tokens/day, data residency
Prompt caching	Provider handles automatically	Need cross-request cache control
Routing	Use a proxy (LiteLLM, PortKey)	Complex business rules, need full auditability
Agent framework	LangGraph, CrewAI, or Claude SDK	Simple agent, or framework adds no value
Eval harness	Braintrust or Langfuse	Deep integration with internal data pipeline
Memory store	pgvector, Qdrant	Existing DB infra, specific schema requirements

The most common mistake: building infrastructure before you have users. API costs are fixed per token; ops burden is fixed per engineer. Start managed, switch when the numbers force you to.

Quality at the Application Layer

Evals are the only way to ship confidently. At the application layer, this means:

Define what "correct" means before building. For a support bot, is correct "user gets the right answer" or "user does not escalate"? These are different metrics with different instrumentation.

Sample production traffic for eval. Synthetic benchmarks measure what you imagined users would ask. Real user queries expose failure modes you did not anticipate.

Regression test every prompt change. System prompt edits are code changes. They should go through the same review and eval pipeline as code changes.

Score at the output level, not the turn level. An agent that takes five suboptimal steps but produces the right final answer is still a good agent. Optimize for outcomes, not paths.

Evaluations → | Prompt Injection & Security →

Latency at the Application Layer

User experience is directly coupled to latency:

Latency to first token	User perception
< 500ms	Feels instant
500ms–2s	Acceptable for complex tasks
2s–5s	Noticeable; acceptable only with a loading indicator
> 5s	Users abandon or lose trust

Streaming is required for most agent applications. Even if total generation takes 10 seconds, showing the first tokens within 1 second is the difference between usable and frustrating.

For multi-step agents, show progress — "Searching documentation...", "Writing tests..." — so users understand why it takes longer than a single model call.

Latency →

Production Reality

The demo-to-product gap is the eval gap. A demo works because you test it. A product breaks because users do things you did not test. The delta is almost always insufficient eval coverage on real user inputs. Ship evals before you ship features.

Safety and misuse surface at scale. Ten users in beta will not find your prompt injection vulnerability. Ten thousand users will. Red-team your application before launch, not after an incident. Prompt Injection & Security →

Latency expectations are set by the first experience. If your agent takes 8 seconds on the first run, users will accept 8 seconds next time. If it takes 1 second once, they will lose patience at 3 seconds. First-run performance shapes the perception of all subsequent runs.

Agentic actions compound errors. A single-turn LLM that is wrong 5% of the time is usually acceptable. An agent that makes 10 sequential decisions with a 5% error rate per step fails the full task ~40% of the time. Plan for graceful failure, human escalation, and undo paths from the start.

Users do not read the limitations. No matter what you write in the documentation, users will ask your coding agent about their health, ask your support bot for legal advice, and ask your research agent to write their thesis. Design for the gap between intended use and actual use.

Evaluations — for measuring application quality on real user tasks
Latency — for the UX cliff users feel before they read any output
Autonomous Agent Systems — for packaged products you run instead of building from scratch
Prompt Injection & Security — for application-layer safety boundaries

Agent Applications ​

The Production Decision ​

Application Categories ​

Coding Agents ​

Personal AI Assistants ​

Support and Domain Bots ​

Research and Synthesis Agents ​

Build vs Buy ​

Quality at the Application Layer ​

Latency at the Application Layer ​

Production Reality ​

Related Topics ​

Agent Applications

The Production Decision

Application Categories

Coding Agents

Personal AI Assistants

Support and Domain Bots

Research and Synthesis Agents

Build vs Buy

Quality at the Application Layer

Latency at the Application Layer

Production Reality

Related Topics