Grouped Query Attention and Multi-Query Attention: KV Cache Compression, Inference Efficiency, and the Memory Bandwidth Bottleneck in Large Language Models

Abstract

Memory bandwidth is the dominant constraint on autoregressive inference in large language models. As sequence lengths grow and batch sizes increase, the key-value (KV) cache inflates to gigabytes per request, saturating GPU memory and throttling throughput. Multi-Query Attention (MQA) and Grouped Query Attention (GQA) address this bottleneck by reducing the number of independent KV heads, sharing cached representations across query heads. This paper presents a rigorous technical analysis of both approaches: we derive the exact memory and bandwidth savings they confer, examine the quality-efficiency tradeoff under different grouping configurations, and contextualize these architectures within the broader landscape of efficient transformer inference. We further analyze the theoretical relationship between head sharing and model expressivity, discuss practical implementation considerations including uptraining existing models to GQA configurations, and survey empirical results across published benchmarks. Our analysis suggests that GQA achieves near-MHA quality at near-MQA memory cost, making it the preferred default for production LLM deployment.

1. Introduction

The inference cost of autoregressive language models is fundamentally asymmetric with training. During a single forward pass at inference time, the model must repeatedly access a growing cache of key and value tensors for every previously generated token. This KV cache occupies memory proportional to the product of sequence length, batch size, number of layers, number of KV heads, and head dimension. For a 70-billion parameter model with 64 layers, 64 heads, and head dimension 128, a single sequence of 8,192 tokens requires roughly 64 � 2 � 64 � 128 � 4 bytes ≈ 4 GB of KV cache. Scaling to a batch of 32 concurrent users demands 128 GB just for the cache—exceeding the memory of most multi-GPU deployments.

The original Transformer architecture (Vaswani et al., 2017) performs Multi-Head Attention (MHA), in which each of the $h$ attention heads maintains independent key and value projections. This full parameterization is natural for training, where all heads can be updated simultaneously and memory is bounded by activations rather than caches. But at inference, the KV cache for MHA is $O(h \cdot d_k \cdot L)$ per layer, and reducing this without sacrificing expressivity is the central challenge.

Two complementary solutions have emerged. Shazeer (2019) proposed Multi-Query Attention (MQA), in which all query heads share a single key and value head—reducing KV cache size by a factor of $h$. Ainslie et al. (2023) generalized this to Grouped Query Attention (GQA), in which $h$ query heads are partitioned into $G$ groups, each group sharing one KV head, interpolating between MHA ($G = h$) and MQA ($G = 1$). Both approaches leave query projections fully independent, preserving the diversity of attention patterns while compressing the representational burden on the cache.

This paper provides a unified treatment of MQA and GQA. Section 2 surveys prior work on KV cache reduction and memory-efficient attention. Section 3 develops the formal analysis of bandwidth savings, expressivity, and the mathematics of group partitioning. Section 4 discusses empirical results and practical deployment considerations. Section 5 reflects on open questions and architectural implications. Section 6 concludes.

2. Related Work

Vaswani et al. (2017) introduced the Transformer with Multi-Head Attention, establishing the baseline against which all subsequent attention variants are measured. MHA projects inputs into $h$ independent query, key, and value spaces, concatenates attended outputs, and projects back. This formulation has proven remarkably robust across domains but carries quadratic memory in both sequence length and head count during caching.

Shazeer (2019) proposed Multi-Query Attention in the context of fast autoregressive decoding. The key insight is that key and value projections can be shared across all query heads with minimal quality degradation, while the number of distinct query projections (and thus the diversity of attended information) is preserved. Empirical results on language modeling and machine translation showed negligible perplexity increases with 10–20� KV cache compression.

Ainslie et al. (2023) formalized Grouped Query Attention and introduced a method for converting pretrained MHA models to GQA via mean-pooling of grouped KV heads followed by continued pretraining (“uptraining”). Their experiments on T5 and language modeling benchmarks demonstrated that GQA with $G = 8$ groups closely matches MHA quality while approaching MQA’s memory efficiency.

Pope et al. (2023) analyzed KV cache partitioning strategies for serving large models across multiple TPU chips, establishing that KV cache bandwidth, not arithmetic operations, dominates the cost of autoregressive decoding at standard batch sizes. This work provides the hardware motivation for MQA/GQA: the bottleneck is memory reads, not FLOPs.

Kwon et al. (2023) introduced PagedAttention as part of the vLLM serving system, demonstrating that KV cache fragmentation—not just size—is a major source of memory waste. MQA and GQA directly reduce the raw size of each page, amplifying the benefits of paged management.

Dao et al. (2022) developed FlashAttention, showing that IO-aware tiling of attention computation dramatically reduces HBM reads during the forward pass. FlashAttention and MQA/GQA are complementary: FlashAttention reduces memory traffic during the prefill phase, while MQA/GQA reduces KV cache bandwidth during the decode phase.

Brandon et al. (2024) proposed Grouped Query Attention with sliding window variants and analyzed the interaction between head grouping and context length extrapolation, showing that KV compression can be combined with positional interpolation techniques without additional degradation.

3. Technical Analysis

3.1 Formal Setup

Let the input to an attention layer be $X \in \mathbb{R}^{L \times d}$ where $L$ is sequence length and $d$ is model dimension. In standard MHA with $h$ heads and head dimension $d_k = d / h$:

$$Q_i = X W_i^Q, \quad K_i = X W_i^K, \quad V_i = X W_i^V \quad \text{for } i = 1, \ldots, h$$

where $W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{d \times d_k}$. The attention output for head $i$ is:

$$\text{Attn}_i(X) = \text{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d_k}}\right) V_i$$

The full MHA output is the concatenation of all head outputs projected through $W^O \in \mathbb{R}^{d \times d}$.

3.2 Multi-Query Attention

MQA replaces the $h$ independent KV projections with a single shared pair $(W^K, W^V) \in \mathbb{R}^{d \times d_k}$:

$$Q_i = X W_i^Q, \quad K = X W^K, \quad V = X W^V \quad \text{for } i = 1, \ldots, h$$

$$\text{Attn}_i^{\text{MQA}}(X) = \text{softmax}\!\left(\frac{Q_i K^\top}{\sqrt{d_k}}\right) V$$

The KV cache now stores only $2 \times L \times d_k$ values per layer regardless of $h$, a compression factor of $h$ compared to MHA. For $h = 64$, this is a 64� reduction in cached memory. The query projections remain fully independent, so the model retains $h$ distinct attention distributions but aggregates over the same value representation.

3.3 Grouped Query Attention

GQA partitions the $h$ query heads into $G$ groups of size $h/G$. Each group $g \in \{1, \ldots, G\}$ is assigned shared KV projections $(W_g^K, W_g^V)$. For query head $i$ in group $g(i)$:

$$\text{Attn}_i^{\text{GQA}}(X) = \text{softmax}\!\left(\frac{Q_i K_{g(i)}^\top}{\sqrt{d_k}}\right) V_{g(i)}$$

The KV cache stores $2G \times L \times d_k$ values per layer, giving a compression factor of $h/G$. The endpoints recover MHA (G = h) and MQA (G = 1). For typical choices such as $G = 8$ with $h = 64$, the compression is 8�.

3.4 Memory and Bandwidth Analysis

Let $N_L$ be the number of layers. The total KV cache size for a sequence of length $L$ and batch size $B$ is:

$$\text{Cache}_{\text{MHA}} = 2 \cdot N_L \cdot h \cdot d_k \cdot L \cdot B \cdot \text{sizeof}(\text{dtype})$$

$$\text{Cache}_{\text{GQA}} = 2 \cdot N_L \cdot G \cdot d_k \cdot L \cdot B \cdot \text{sizeof}(\text{dtype})$$

For a 70B model with $N_L = 80$, $h = 64$, $d_k = 128$, $G = 8$, $L = 4096$, $B = 16$, at bfloat16 (2 bytes):

$$\text{Cache}_{\text{MHA}} = 2 \times 80 \times 64 \times 128 \times 4096 \times 16 \times 2 \approx 54 \text{ GB}$$

$$\text{Cache}_{\text{GQA}} = 2 \times 80 \times 8 \times 128 \times 4096 \times 16 \times 2 \approx 6.7 \text{ GB}$$

This 8� reduction in cache footprint directly translates to higher sustainable batch sizes and lower per-token latency, since the fraction of decoding time spent reading the KV cache decreases.

More precisely, for autoregressive decoding, the time to generate one token is dominated by two costs: the arithmetic computation of $QK^\top$ and $\text{softmax}(\cdot)V$ (FLOP-bound), and the memory reads of the KV cache (bandwidth-bound). At typical inference batch sizes and sequence lengths, Pope et al. (2023) showed that modern hardware operates in the memory-bandwidth regime. The effective throughput is:

$$\text{tokens/sec} \approx \frac{\text{HBM bandwidth}}{\text{bytes per token}}$$

where bytes per token for MHA is $2 \cdot N_L \cdot h \cdot d_k \cdot \text{sizeof}$ and for GQA is $2 \cdot N_L \cdot G \cdot d_k \cdot \text{sizeof}$. GQA thus achieves a throughput multiplier of $h/G$ in the bandwidth-limited regime.

3.5 Expressivity Considerations

A natural question is whether KV head sharing degrades the expressive power of the attention layer. Consider the set of functions computable by a single GQA group: all $h/G$ query heads within the group attend to the same key and value projections, producing outputs that lie in the column span of a single $V_g$ matrix. This constrains the diversity of information that different heads in the same group can extract from the value dimension.

Formally, in MHA the output of head $i$ spans the column space of $V_i W_i^V$, and distinct heads can capture orthogonal subspaces of the input. In GQA, all heads within group $g$ share the column space of $V_g W_g^V$, so intra-group diversity can only arise from different query-weighted mixtures over the same set of projected values.

Empirically, this theoretical constraint translates to marginal quality degradation in practice. The key intuition is that most information relevant to generation is redundantly represented across KV heads in well-trained models, and a compressed set of $G$ KV heads is sufficient to capture the diversity needed for high-quality predictions. This is consistent with the observation (Michel et al., 2019) that many MHA heads are prunable with limited quality loss.

3.6 Uptraining: Converting MHA Models to GQA

Ainslie et al. (2023) introduced a practical procedure for converting a pretrained MHA model to GQA without full retraining. The conversion proceeds as follows:

  1. Partition the $h$ KV heads into $G$ groups of $h/G$ heads each.
  2. Within each group, mean-pool the $h/G$ key weight matrices and $h/G$ value weight matrices to produce a single representative $(W_g^K, W_g^V)$ per group.
  3. Continue pretraining the resulting model on a small fraction (typically 5%) of the original training data budget.

This uptraining procedure is inexpensive relative to full pretraining and produces models that closely match MHA quality on downstream evaluations. The mean-pooling initialization is important: random initialization of GQA heads followed by uptraining converges to lower quality, suggesting that the pooled initialization provides a useful warm start in the loss landscape.

3.7 Interaction with Positional Encoding

In Transformer models using Rotary Position Embeddings (RoPE; Su et al., 2021), the rotation applied to queries and keys depends on position. Since GQA applies independent $Q$ rotations per head but shared $K$ rotations per group, the positional structure of the shared keys must be compatible with all query heads in the group. In practice, this raises no issues: the single group key undergoes the standard RoPE rotation at each sequence position, and each query head attends to this rotated key normally. The positional information is correctly propagated because position is encoded in the query-key inner product, not in the value representation.

4. Discussion

4.1 Adoption in Production Models

GQA has been adopted as the default attention architecture in several widely deployed models. Llama 2 (Touvron et al., 2023) uses GQA for its 34B and 70B variants (with $G = 8$) while retaining MHA for the 7B and 13B models—reflecting the observation that the KV cache compression matters more at large scale. Mistral 7B (Jiang et al., 2023) uses GQA even at 7B, motivated by its sliding window attention design which further amplifies memory pressure. Gemma (Team et al., 2024) and several other recent models use GQA universally.

The consistent adoption of $G = 8$ across diverse architectures is notable and suggests an empirically stable sweet spot: 8 KV groups provide an 8� memory reduction with minimal perplexity regression across a wide range of tasks. This may reflect an implicit regularization: fewer KV groups force the model to represent shared contextual information more efficiently, reducing overfitting to idiosyncratic head-specific patterns.

4.2 Interaction with Batch Size and Sequence Length

The absolute benefit of GQA depends on the deployment regime. For short sequences and small batches, the KV cache is modest and the quality gap between MHA and GQA may not justify the architectural change. For long-context applications (document understanding, multi-turn dialogue, code generation over large codebases), the cache grows linearly in sequence length and GQA’s benefit scales proportionally. The memory savings enable serving longer contexts within a fixed VRAM budget, or serving larger batches at the same context length—both of which directly improve throughput.

4.3 Quantization of the KV Cache

GQA reduces the number of KV heads but does not reduce the precision of cached values. Recent work on KV cache quantization (Hooper et al., 2024) shows that keys and values can be stored at 4-bit or even 2-bit precision with limited quality loss, further compressing cache size. GQA and KV quantization are orthogonal and composable: applying both yields multiplicative savings. A GQA model with $G = 8$ and 4-bit KV quantization achieves a 32� reduction in cache memory relative to MHA with bfloat16 values.

4.4 Limitations

Despite its practical success, GQA has theoretical limitations worth acknowledging. First, the optimal grouping strategy—which heads to group together—is typically determined by simple contiguous partitioning rather than by learned or structured clustering. There is evidence that attention heads specialize in distinct linguistic phenomena (Clark et al., 2019), and grouping heads with dissimilar functions may impose tighter constraints than grouping similar ones. Second, for tasks requiring fine-grained access to distinct contextual features from different positions (e.g., complex multi-hop reasoning), the loss of independent KV representations per group may matter more than in standard language modeling evaluations. Third, uptraining is not always available: models without continued pretraining show larger quality gaps, particularly on knowledge-intensive tasks where precise key-value associations are critical.

5. Conclusion

Multi-Query Attention and Grouped Query Attention represent principled architectural responses to the memory bandwidth bottleneck that dominates large language model inference. By sharing key-value heads across query groups, they compress the KV cache by factors of $h$ and $h/G$ respectively, enabling dramatically higher throughput and longer contexts within fixed hardware budgets. Our analysis shows that GQA at $G = 8$ achieves an 8� cache reduction with near-MHA quality across standard benchmarks, making it the preferred architecture for production deployment.

The success of GQA illustrates a broader principle: not all parameters in a neural network contribute equally to output quality, and identifying the subset that can be compressed or shared without functional loss is a reliable path to efficiency. As models continue to scale and deployment pressures intensify, attention mechanisms that separate the representational diversity of queries from the memory cost of key-value caches will remain central to practical LLM engineering. Future work exploring learned grouping strategies, dynamic group assignment conditioned on input, and joint optimization of KV quantization with head grouping offers promising directions for pushing this tradeoff further.

References

FlashAttention and IO-Aware Algorithm Design: Recomputation, Tiling, and the Memory Hierarchy of Efficient Transformers
Sparse Autoencoders for Mechanistic Interpretability: Feature Discovery, Superposition, and the Dictionary Learning Approach to Language Model Internals

Leave a Comment

Your email address will not be published. Required fields are marked *