Abstract
Model quantization — the process of representing neural network weights and activations with reduced numerical precision — has become an indispensable technique for deploying large language models (LLMs) at scale. As parameter counts reach hundreds of billions, the memory and compute requirements of full-precision inference become prohibitive for most practitioners. This paper provides a rigorous analysis of quantization methodology as applied to transformer-based LLMs, covering post-training quantization (PTQ), quantization-aware training (QAT), and mixed-precision schemes. We examine the theoretical foundations governing quantization error, analyze the unique challenges posed by LLM weight distributions (including outlier activations and layer-wise sensitivity), and survey empirical results from recent work including GPTQ, SmoothQuant, LLM.int8(), and AWQ. Our analysis reveals that the precision-performance tradeoff is highly non-linear and architecture-dependent, with outlier-driven degradation remaining the central unsolved challenge below 4-bit precision. We conclude by identifying open problems and directions for future research.
1. Introduction
The deployment of large language models presents a fundamental resource tension: the models that achieve state-of-the-art performance are also the most expensive to run. A 70-billion parameter model stored in 16-bit floating point requires approximately 140 GB of GPU memory — exceeding the capacity of all but the most expensive accelerators. Quantization offers a principled path through this constraint by reducing the number of bits used to represent each parameter.
The central challenge of quantization is information loss. When a 32-bit or 16-bit floating-point value is mapped to a discrete set of representable values at lower precision, some information is necessarily discarded. The question is not whether this loss occurs, but how much performance degradation it induces, and whether the degradation can be minimized through careful calibration, training, or architectural design.
For classical convolutional neural networks, 8-bit quantization had been largely solved by 2018 through techniques like symmetric and asymmetric fixed-point quantization with per-channel scaling. The situation for transformer-based LLMs is considerably more complex. Transformer activations exhibit systematic heavy-tailed distributions — a small fraction of activation channels take on values orders of magnitude larger than typical values — making standard quantization schemes highly suboptimal.
This paper analyzes the quantization problem for LLMs from first principles. Section 2 surveys related work and prior methods. Section 3 provides a technical analysis of quantization error, outlier structure, and the mathematical machinery of modern quantization algorithms. Section 4 discusses practical implications and open problems. Section 5 concludes.
2. Related Work
The literature on neural network quantization spans more than a decade and has accelerated significantly with the rise of LLMs. We survey the most relevant threads.
Courbariaux et al. (2016) introduced BinaryNet and trained networks with binary weights ($\{-1, +1\}$), demonstrating that very aggressive quantization could preserve surprising amounts of accuracy for image classification tasks. While direct application to LLMs is infeasible, the theoretical framework for training with quantized representations originates largely from this line of work.
Jacob et al. (2018) proposed the quantization scheme adopted in TensorFlow Lite and widely deployed in production: symmetric and asymmetric 8-bit integer quantization with per-tensor or per-channel scale factors, calibrated on a representative dataset. This approach works well for CNN inference and established the engineering vocabulary still used today.
Dettmers et al. (2022) identified the core challenge specific to LLMs in their LLM.int8() paper: a small fraction of activation dimensions — roughly 0.1% — take on values dramatically larger than the rest, causing quantization error to be concentrated and highly damaging. Their solution, mixed-precision decomposition, identifies outlier dimensions and computes their contributions in 16-bit while quantizing the remainder to 8-bit, achieving near-lossless performance at effectively INT8 compute cost for linear layers.
Frantar et al. (2022) introduced GPTQ, a post-training quantization algorithm based on the Optimal Brain Quantization (OBQ) framework originally derived from second-order optimization methods. GPTQ quantizes weights layer-by-layer using approximate Hessian information from a small calibration set, achieving 4-bit weight quantization of 175-billion parameter models with minimal perplexity degradation. It remains one of the dominant PTQ methods for LLMs.
Xiao et al. (2023) proposed SmoothQuant, which addresses the activation outlier problem not by mixed precision but by mathematically migrating the quantization difficulty from activations to weights. Through a channel-wise scaling transformation, activation magnitudes are smoothed at the cost of increased weight magnitude variance, which proves easier to quantize due to weights’ static nature.
Lin et al. (2023) introduced AWQ (Activation-aware Weight Quantization), observing that not all weights are equally important and that protecting the small fraction corresponding to salient activation channels is sufficient to preserve model quality. AWQ achieves hardware-efficient 4-bit quantization without reordering or grouping heuristics from prior methods.
3. Technical Analysis
3.1 Quantization Formalism
Let $\mathbf{W} \in \mathbb{R}^{m \times n}$ denote a weight matrix and let $b$ denote the target bit-width. A uniform affine quantizer maps each element $w \in \mathbb{R}$ to a quantized value $\hat{w}$ via:
$$\hat{w} = s \cdot \text{clamp}\left(\left\lfloor \frac{w}{s} \right\rceil + z, \, 0, \, 2^b – 1\right) – s \cdot z$$
where $s \in \mathbb{R}^+$ is the scale factor, $z \in \mathbb{Z}$ is the zero-point (for asymmetric quantization), $\lfloor \cdot \rceil$ denotes rounding, and clamp clips values to the representable range $[0, 2^b – 1]$. The quantization error for a single weight is $\epsilon_w = w – \hat{w}$.
For symmetric quantization ($z = 0$), the scale is determined by the weight range:
$$s = \frac{\max(|\mathbf{W}|)}{2^{b-1} – 1}$$
The total quantization error for a layer’s output, given input activations $\mathbf{x}$, is:
$$\Delta \mathbf{y} = (\mathbf{W} – \hat{\mathbf{W}}) \mathbf{x} = \mathbf{E} \mathbf{x}$$
where $\mathbf{E} = \mathbf{W} – \hat{\mathbf{W}}$ is the weight quantization error matrix. The expected squared output error is:
$$\mathbb{E}[\|\Delta \mathbf{y}\|^2] = \text{tr}(\mathbf{E} \mathbf{H} \mathbf{E}^\top)$$
where $\mathbf{H} = \mathbb{E}[\mathbf{x} \mathbf{x}^\top]$ is the input covariance (Hessian of the squared loss with respect to the weights). This is precisely the objective that GPTQ minimizes.
3.2 Optimal Brain Quantization
GPTQ proceeds by quantizing weights one column at a time, updating the remaining unquantized weights to compensate for the error introduced. For a single layer with weight matrix $\mathbf{W}$ and Hessian $\mathbf{H}$, quantizing weight $w_q$ introduces error $\delta_q = \hat{w}_q – w_q$. The optimal update to all remaining weights is:
$$\delta_{-q} = -\frac{\delta_q}{[\mathbf{H}^{-1}]_{qq}} \cdot (\mathbf{H}^{-1})_{:,q}$$
This is the classic OBS (Optimal Brain Surgeon) update. GPTQ approximates the full OBS procedure by processing weights column-by-column and using a Cholesky decomposition of the inverse Hessian for numerical stability, reducing the $O(n^3)$ per-weight complexity to an amortized $O(n^2)$ per column pass.
The key empirical finding is that despite these approximations, GPTQ achieves near-minimal quantization error on large transformers, suggesting the error landscape is sufficiently well-conditioned that column-wise greedy optimization is near-optimal.
3.3 Activation Outliers and Mixed-Precision Decomposition
The outlier phenomenon in LLM activations fundamentally complicates weight-only quantization strategies. Define an outlier dimension $d^*$ as one for which:
$$|x_{d^*}| \gg \mu_x + k \sigma_x$$
for some threshold $k$. Dettmers et al. observe that these dimensions are (a) consistent across inputs and (b) grow systematically with model scale. For a 6.7B parameter OPT model, roughly 25% of sequence positions contain at least one outlier; for 66B parameters, this rises to nearly 100%.
The damage is asymmetric: if activation $x_{d^*}$ is large and weight column $w_{:,d^*}$ is quantized aggressively, the product $w_{:,d^*} x_{d^*}$ accumulates large absolute error in the output. LLM.int8() decomposes the matrix multiplication as:
$$\mathbf{W} \mathbf{x} = \mathbf{W}_{\mathcal{O}} \mathbf{x}_{\mathcal{O}} + \hat{\mathbf{W}}_{\mathcal{N}} \hat{\mathbf{x}}_{\mathcal{N}}$$
where $\mathcal{O}$ indexes outlier dimensions (computed in FP16) and $\mathcal{N}$ indexes normal dimensions (computed in INT8). The overhead is modest: since $|\mathcal{O}| / d \approx 0.001$, the FP16 path handles a negligible fraction of compute.
3.4 SmoothQuant: Migration of Quantization Difficulty
SmoothQuant takes a different approach: rather than mixed precision, it applies a channel-wise transformation before quantization. For a linear layer $\mathbf{y} = \mathbf{W} \mathbf{x}$, define a diagonal scaling matrix $\mathbf{D} = \text{diag}(s_1, \ldots, s_d)$ and rewrite:
$$\mathbf{y} = (\mathbf{W} \mathbf{D}) (\mathbf{D}^{-1} \mathbf{x})$$
The scale factors $s_j$ are chosen to migrate activation magnitude to weights:
$$s_j = \frac{\max(|\mathbf{x}_{:,j}|)^\alpha}{\max(|\mathbf{W}_{j,:}|)^{1-\alpha}}$$
With $\alpha = 0.5$, the geometric mean of activation and weight ranges determines the scale, effectively balancing the quantization difficulty between the two. Since $\mathbf{D}^{-1}$ can be absorbed into the preceding layer normalization (which is a static operation at inference time), the transformation adds zero runtime overhead while dramatically improving quantization quality.
3.5 AWQ: Protecting Salient Weights
AWQ (Activation-aware Weight Quantization) is motivated by the observation that 1% of weights, when protected from quantization, can recover most of the model’s performance. The key insight is that weight salience is determined by the corresponding activation magnitudes: weight $w_{ij}$ contributes $w_{ij} x_j$ to the output, so its effective importance scales with $|x_j|$.
Rather than protecting weights through mixed precision (which complicates hardware implementation), AWQ applies per-channel scaling to weights:
$$\hat{w}_{ij} = Q\left(w_{ij} \cdot s_j\right) / s_j$$
where $Q(\cdot)$ denotes the quantization operator and $s_j > 1$ for salient channels (channels with large $|x_j|$). Scaling up a weight before quantization effectively allocates more representational capacity to it, reducing its relative quantization error. The scale factors are found by grid search to minimize output error on a calibration set.
AWQ achieves INT4 quantization of LLaMA and other architectures with perplexity degradation of 0.1–0.5 points on WikiText-2, compared to 2–5+ points for naive INT4 quantization, and is hardware-efficient due to its structured scaling operations.
3.6 Quantization-Aware Training
All methods discussed above are post-training — they quantize a pre-trained model without gradient-based optimization. Quantization-aware training (QAT) instead simulates quantization during the forward pass and uses straight-through estimators (STEs) to propagate gradients through the (otherwise non-differentiable) rounding operation:
$$\frac{\partial \mathcal{L}}{\partial w} \approx \frac{\partial \mathcal{L}}{\partial \hat{w}} \cdot \mathbf{1}\left[w \in [w_{\min}, w_{\max}]\right]$$
QAT generally achieves better performance than PTQ at the same bit-width, particularly below 4 bits, but requires access to training data and compute budgets that scale with model size. For 70B+ parameter models, QAT is currently impractical except on very large compute clusters. Recent work on parameter-efficient QAT (e.g., quantizing only adapters or final layers) attempts to bring this within reach, but these methods have not yet demonstrated the same quality as full QAT.
3.7 Bit-Width Sensitivity Analysis
Empirical results across methods reveal a consistent pattern. For most architectures:
- INT8 (8-bit): Near-lossless with calibration; perplexity degradation typically <0.1 points.
- INT4 (4-bit): Achievable with modern methods (GPTQ, AWQ); degradation 0.1–0.5 points on standard benchmarks.
- INT3 (3-bit): Significant degradation without QAT; 1–3 point perplexity increase typical.
- INT2 (2-bit): Severe degradation for most models; requires specialized architectures or training procedures.
The transition from 4-bit to 3-bit represents the current hard frontier. The information-theoretic lower bound for weight quantization given a fixed perplexity budget can be derived from rate-distortion theory, but practical methods remain far from this bound at sub-4-bit precision.
4. Discussion
4.1 The Hardware-Algorithm Gap
A persistent challenge in LLM quantization is the gap between algorithmic progress and hardware support. Modern GPUs offer native INT8 tensor core operations (e.g., NVIDIA’s DP4A instruction) and experimental INT4 support. However, the irregular mixed-precision patterns produced by LLM.int8() and similar methods are difficult to map efficiently to these hardware primitives.
AWQ and GPTQ with group quantization (quantizing groups of $g$ weights with shared scale factors, typically $g = 128$) are specifically designed to produce regular patterns amenable to hardware-efficient GEMM implementations. The TensorRT-LLM and ExLlama kernels exploit this structure to achieve near-theoretical speedups on consumer and data-center GPUs.
4.2 Activation vs. Weight Quantization
Weight-only quantization (W4A16 — 4-bit weights, 16-bit activations) is considerably easier than weight-and-activation quantization (W4A4). The former requires only static calibration; the latter must handle the full dynamic range of runtime activations, including the outlier problem at serving time.
W4A4 is desirable because INT4 multiply-accumulate operations provide 4� throughput over FP16 on supporting hardware. Current W4A4 methods (e.g., QuaRot, SpinQuant) require rotation-based preprocessing to redistribute outlier energy, achieving results competitive with W8A8 but significantly below FP16 on reasoning-heavy benchmarks. This gap likely reflects genuine information loss in the 4-bit activation representation of semantically complex transformer states.
4.3 Calibration Data Sensitivity
PTQ methods depend on a calibration dataset to estimate Hessians, activation statistics, or weight salience. The sensitivity to calibration data choice is an underexplored empirical question. GPTQ and AWQ typically use 128 samples from C4 or WikiText; preliminary results suggest that domain mismatch between calibration and deployment data can degrade quantization quality by 0.5–2 perplexity points — a non-trivial effect that practitioners should account for.
4.4 Structural Quantization
An alternative to uniform quantization is non-uniform quantization, where representable values are spaced non-uniformly to better match the empirical weight distribution. Normal float (NF4) quantization, introduced with QLoRA (Dettmers et al., 2023), uses quantile-based assignment of representation values to minimize expected quantization error for normally-distributed weights:
$$v_i = \Phi^{-1}\left(\frac{i}{2^k – 1}\right), \quad i = 0, \ldots, 2^k – 1$$
where $\Phi^{-1}$ is the inverse normal CDF. NF4 achieves consistently lower quantization error than INT4 for normally-distributed weights, which transformer weights approximately follow, and has been adopted in several production frameworks.
5. Conclusion
Quantization has become a critical component of the LLM deployment stack. Post-training methods — GPTQ, SmoothQuant, AWQ, LLM.int8() — have pushed the practical frontier to 4-bit weights with minimal performance degradation, enabling deployment of large models on commodity hardware. The theoretical foundations of these methods in second-order optimization, rate-distortion theory, and channel-wise scaling are now well-understood.
The central open problems are: (1) extending reliable quantization below 4 bits without QAT, (2) closing the W4A4 gap with W8A8, (3) reducing sensitivity to calibration data distribution, and (4) developing quantization methods native to emerging attention-free architectures (state space models, linear attention variants) whose activation statistics differ qualitatively from softmax transformers.
As model sizes continue to grow and deployment shifts toward edge and on-device inference, quantization will only become more central to the practical AI engineering toolkit. The field would benefit from more rigorous theoretical characterization of the information-theoretic lower bounds at each precision level, and from systematic benchmarking that separates the effects of model size, architecture, training data, and quantization method in a controlled manner.
References
- Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., & Bengio, Y. (2016). Binarized Neural Networks. Advances in Neural Information Processing Systems (NeurIPS), 29.
- Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., … & Kalenichenko, D. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. Advances in Neural Information Processing Systems (NeurIPS), 35.
- Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv preprint arXiv:2210.17323.
- Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. (2023). SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. International Conference on Machine Learning (ICML).
- Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv preprint arXiv:2306.00978.
- Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. Advances in Neural Information Processing Systems (NeurIPS), 36.
- Hassibi, B., Stork, D. G., & Wolff, G. J. (1993). Optimal Brain Surgeon and General Network Pruning. IEEE International Conference on Neural Networks.