KV Cache & Quantization

Two independent memory optimization techniques that are often discussed together because they both determine how much VRAM you need — and therefore what you can afford to run and at what context length.

The Decision

If you're using managed APIs, quantization is not your problem — providers handle it. This page matters when:

You're evaluating self-hosted deployment options
You're trying to understand why long context is expensive
You're choosing between model variants (FP16 vs. INT4, etc.)

KV Cache

During autoregressive generation, the model computes attention over every previously generated token. The key-value (KV) tensors from those computations are cached and reused — the "KV cache."

Without it, generating token N would require re-running attention over all N-1 prior tokens from scratch. With the KV cache, only the new token's attention needs to be computed; everything else is looked up from memory.

The problem: KV cache size grows linearly with sequence length and model size.

KV Cache Memory Math

KV cache size = 2 × layers × heads × head_dim × seq_len × bytes_per_element

For a Llama 3 70B model in FP16:

80 layers, 64 heads, 128 head dimensions
FP16 = 2 bytes per element

Sequence length	KV cache VRAM
4K tokens	~3.5 GB
32K tokens	~27 GB
128K tokens	~109 GB

This is additive to the model weights (~140 GB for Llama 3 70B in FP16). Long context doesn't just test the context window limit — it strains VRAM.

PagedAttention

vLLM's key innovation. Traditional inference servers pre-allocate a contiguous VRAM block for the maximum possible sequence length, even if most requests are short. This wastes VRAM on every short request.

PagedAttention stores KV cache in non-contiguous pages (like OS virtual memory). Each request gets pages as needed; pages are freed when the request completes. Result: 2–4× higher GPU utilization and throughput on typical workloads.

KV Cache Compression Techniques

For very long contexts or limited VRAM:

KV cache quantization — store K and V tensors in INT8 or INT4 instead of FP16. Halves or quarters the VRAM cost with minor quality degradation at moderate sequence lengths.

StreamingLLM — evict old tokens' KV cache, keeping only recent tokens and the initial "attention sink" tokens (the first few tokens that receive disproportionate attention in most models). Allows infinite-length generation with fixed VRAM, at the cost of losing access to middle-context information.

H2O (Heavy Hitter Oracle) — selectively evict tokens with low attention scores, keeping only "heavy hitters." Better quality than StreamingLLM's fixed-window approach for tasks where important tokens are scattered throughout the context.

Quantization

Quantization reduces the numerical precision of model weights (and optionally KV cache) to fit more into VRAM and speed up computation.

Precision Formats

Format	Bits	VRAM vs. FP32	Speed vs. FP32	Quality loss
FP32	32	1.0× (baseline)	1.0×	None (reference)
BF16	16	0.5×	~1.5×	Minimal
FP16	16	0.5×	~1.5×	Minimal
INT8 (W8A16)	8	0.25×	~1.8×	Low
INT4 (GPTQ/AWQ)	4	0.125×	~2.5×	Moderate
INT4 (EXL2 mixed)	~4 avg	~0.125×	~2.5×	Low–moderate
GGUF Q4_K_M	~4 avg	~0.125×	varies	Low–moderate

BF16 is the modern standard for training and high-quality inference. FP16 and BF16 are nearly equivalent in practice; BF16 has better dynamic range for large values.

Quantization Methods

GPTQ — post-training quantization using a second-order optimization. Good quality-per-bit; slower to apply than simpler methods. Standard for 4-bit GPU inference.

AWQ (Activation-Aware Weight Quantization) — protects salient weights (those with large activations) from aggressive quantization. Better quality than GPTQ at the same bit-width, especially for instruction-following tasks.

EXL2 — finer-grained per-layer bitwidth allocation. Assigns more bits to critical layers, fewer to less sensitive ones. Best quality-per-bit among 4-bit methods; NVIDIA-only.

GGUF (llama.cpp format) — quantization format designed for CPU + GPU hybrid inference. Supports a range of quantization levels (Q2 through Q8). The Q4_K_M variant is a good quality/size balance for local use.

What Quantization Costs

Quantization losses are not uniform:

Factual recall — usually preserved well at INT8, degrades at INT4
Complex reasoning — most sensitive to quantization; INT4 models can miss multi-step logic that FP16 handles
Instruction following — generally preserved well down to Q4/AWQ
Code generation — degraded at aggressive quantization (Q3 and below)

The practical floor for quality-sensitive production use is AWQ INT4 or GPTQ INT4. Below that (Q3, Q2), you're trading significant capability for VRAM savings.

Practical Implications

Running Llama 4 Scout (17B×16E MoE) locally:

Quantization	Active param VRAM	Fits on
BF16	~34 GB	A100 80GB (tight)
INT8	~17 GB	RTX 4090 24GB (partial layers to CPU)
Q4_K_M	~10 GB	RTX 3080 / Mac M2 16GB

Running DeepSeek-R1 671B (MoE, 37B active):

Quantization	VRAM needed	Fits on
FP16 weights	~1.3 TB total	Not feasible single-node
INT8 weights	~680 GB	~8× A100 80GB
Q4 weights	~340 GB	~4–5× A100 80GB
Q2_K (lossy)	~170 GB	~2–3× A100 80GB

For large MoE models, the relevant memory is for the active parameters — but you still need to store all weights.

Production Reality

INT8 is usually the right trade-off — near-FP16 quality with half the VRAM. Most production deployments use W8A16 (8-bit weights, 16-bit activations). INT4 is for VRAM-constrained environments where quality compromise is acceptable.

Benchmark on your task, not benchmarks — published accuracy numbers are averages across diverse tasks. Your specific use case may be more or less sensitive to quantization. Always evaluate against your own eval set.

KV cache is often the binding constraint — for long-context use cases, quantizing the KV cache (INT8 or INT4 KV) can unlock more headroom than quantizing weights. Tools like vLLM support --kv-cache-dtype fp8 for this.

Cloud providers handle this for you — managed API providers run their own optimized quantized models. The Claude Opus 4.7 and GPT-5.5 you call via API are not running in FP32. This is a self-hosting concern.

Context Windows — for the product-level tradeoffs that make KV cache growth hurt
LLM Serving — for deployment systems that implement paging, batching, and memory optimization
Latency — for the throughput and response-time effects of memory pressure

KV Cache & Quantization ​

The Decision ​

KV Cache ​

KV Cache Memory Math ​

PagedAttention ​

KV Cache Compression Techniques ​

Quantization ​

Precision Formats ​

Quantization Methods ​

What Quantization Costs ​

Practical Implications ​

Production Reality ​

Related Topics ​