Abstract
Tokenization—the process of segmenting raw text into discrete units for neural processing—is the least-scrutinized component of modern NLP pipelines yet exerts substantial influence over downstream task performance. Despite widespread adoption of subword algorithms such as Byte Pair Encoding (BPE) and Unigram Language Modeling, systematic analysis of how tokenizer design choices propagate through transformer architectures remains underrepresented in the literature. This work surveys the theoretical and empirical landscape of tokenization effects, examining how vocabulary size, granularity, and language-specific segmentation interact with model capacity, cross-lingual transfer, and task-specific objectives. We analyze three primary failure modes: over-segmentation of morphologically rich languages, vocabulary fertility imbalance across linguistic domains, and character-level information leakage that distorts embedding geometry. Drawing on results from multilingual NLP benchmarks and ablation studies, we argue that tokenization should be treated as a first-class modeling decision rather than an engineering afterthought, and outline directions for learned, task-aware tokenization schemes.
1. Introduction
Modern transformer-based language models owe much of their success to the attention mechanism and scale, but the pipeline begins before any weight matrix is applied: with tokenization. The choice of how to convert raw text into integer token sequences determines the effective vocabulary over which probability distributions are defined, the sequence lengths models must process, and the implicit inductive biases baked into every downstream representation.
Despite this centrality, tokenization is often treated as a solved problem. Practitioners select a pretrained tokenizer—typically the one bundled with a foundation model—and proceed without scrutiny. Yet a growing body of evidence suggests that tokenization mismatch between pretraining and fine-tuning domains can account for non-trivial performance degradation, sometimes exceeding the gap between model sizes. For example, Rust et al. (2021) demonstrated that adapting BERT’s tokenizer to target languages in multilingual settings substantially closes the performance gap with language-specific models—without any architectural change.
The problem is multidimensional. Tokenizers trained predominantly on English internet text encode semantic and morphological assumptions that do not transfer to agglutinative languages (Finnish, Turkish, Hungarian), logographic scripts (Chinese, Japanese), or technical domains (code, mathematical notation, clinical text). When a word that carries a single semantic unit in one language maps to dozens of subword tokens, the model must learn to reconstruct meaning across fragmented sequences—a task that consumes representational capacity and introduces sequence-length bottlenecks.
This paper provides a structured analysis of tokenization effects across four dimensions: (1) vocabulary design choices and their information-theoretic properties; (2) the relationship between subword granularity and model perplexity vs. task accuracy; (3) cross-lingual fertility imbalance and its effect on multilingual model quality; and (4) emerging approaches to learned, adaptive, and task-aware tokenization. Our goal is to surface tokenization as a design variable that warrants the same empirical rigor applied to architecture and training objectives.
2. Related Work
The modern tokenization landscape descends from three algorithmic families, each with distinct design philosophies. Sennrich et al. (2016) introduced Byte Pair Encoding (BPE) for neural machine translation, demonstrating that subword segmentation substantially improved handling of rare and out-of-vocabulary words by merging frequent character pairs iteratively. BPE became the dominant approach for GPT-family models and remains widely used.
Kudo and Richardson (2018) proposed the SentencePiece framework, which introduced Unigram Language Model tokenization alongside BPE as a unified, language-agnostic implementation operating on raw Unicode characters. The Unigram variant uses EM-based vocabulary pruning to maximize corpus likelihood, producing probabilistic segmentations amenable to regularization during training.
The fertility problem—measuring how many tokens a word expands to under a given tokenizer—was systematically characterized by Rust et al. (2021) in the context of multilingual BERT. They showed that languages with high fertility (many tokens per word) systematically underperform in cross-lingual transfer tasks, and proposed vocabulary adaptation through vocabulary extension and tokenizer replacement as a practical remedy.
Tokenization’s interaction with downstream task performance in code understanding was examined by Xu et al. (2022) in the CodeBERT lineage, where identifier splitting heuristics and tokenizer vocabulary coverage over programming language keywords proved crucial for structural code representations. Similarly, Gururangan et al. (2020) demonstrated through domain-adaptive pretraining that even with fixed tokenizers, domain-specific continued pretraining improves task performance—implicitly suggesting that tokenizer-domain mismatch is partially compensable but not fully recoverable through fine-tuning alone.
More recently, Zouhar et al. (2023) provided information-theoretic grounding for tokenizer evaluation, proposing tokenizer goodness metrics based on the R�nyi entropy of vocabulary distributions and showing correlation with downstream NLU performance across GLUE benchmarks. Their work provides a principled basis for tokenizer comparison independent of end-task fine-tuning—a framework we build on in Section 3.
3. Technical Analysis
3.1 Vocabulary Size and Information-Theoretic Constraints
A tokenizer defines a function $\tau: \Sigma^* \rightarrow \mathcal{V}^*$ mapping strings over alphabet $\Sigma$ to sequences of vocabulary tokens $v \in \mathcal{V}$, where $|\mathcal{V}|$ is the vocabulary size. The fundamental tradeoff is between compression (longer tokens → shorter sequences → less computation) and expressiveness (finer granularity → better handling of morphological variation).
For a corpus $\mathcal{C}$, the fertility of tokenizer $\tau$ on language $\ell$ is:
$$F(\tau, \ell) = \frac{1}{|\mathcal{C}_\ell|} \sum_{w \in \mathcal{C}_\ell} |\tau(w)|$$
where $|\mathcal{C}_\ell|$ is the word count in the language-$\ell$ subset. High fertility corresponds to over-segmentation: morphological units are split across tokens, forcing the model to reconstruct meaning from fragments. For agglutinative languages, $F$ can exceed 3–5× compared to English on shared vocabularies.
The entropy rate of the tokenized corpus provides a lower bound on compression quality. For vocabulary $\mathcal{V}$ with unigram distribution $p$, the marginal entropy is:
$$H(\mathcal{V}) = -\sum_{v \in \mathcal{V}} p(v) \log p(v)$$
Optimal tokenizers maximize coverage (low unknown-token rate) while maintaining high entropy—avoiding both frequent UNK tokens and degenerate distributions dominated by a few super-tokens. Zouhar et al. (2023) show that the R�nyi entropy $H_\alpha(\mathcal{V})$ for $\alpha \in (0,1)$ is a better predictor of downstream performance than $H_1$ alone, as it emphasizes the long tail of the vocabulary distribution that determines rare-word handling.
3.2 Subword Granularity and Model Perplexity
The relationship between tokenization granularity and language model perplexity is nonlinear. For a language model $P_\theta$ trained on token sequences, perplexity is:
$$\text{PPL} = \exp\left(-\frac{1}{N}\sum_{i=1}^N \log P_\theta(v_i | v_{Finer tokenization (smaller vocabulary, more tokens per word) lowers per-token perplexity trivially—individual characters are more predictable given context—but raises per-character or per-word perplexity when the model must attend over longer sequences to reconstruct semantic units. This is the granularity-length tradeoff: character-level models achieve perfect compression but require models capable of long-range dependency, while word-level models minimize sequence length at the cost of vocabulary explosion and poor OOV handling.
BPE occupies a middle ground calibrated by merge count $k$. With $k = 0$ merges, BPE reduces to character tokenization; with $k \rightarrow \infty$, it converges toward word-level tokenization. For transformer architectures with quadratic attention complexity $O(n^2 d)$, sequence length $n$ directly determines compute cost, creating a strong incentive for aggressive merging—but aggressive merging degrades rare-word representation and cross-lingual transfer.
The interaction with model capacity is critical: larger models can tolerate higher fertility by learning to compose representations across more tokens, while smaller models saturate their capacity on reconstruction tasks. This implies that the optimal vocabulary size is not a fixed constant but a function of model scale—an underexplored dimension of scaling law analyses.
3.3 Embedding Geometry and Token Boundary Effects
Tokenization directly shapes the geometry of the embedding space. When a semantic unit (e.g., the Turkish word geliyordum — “I was coming”) is split into multiple tokens, the composite meaning must be reconstructed by the transformer across attention layers. The first token in the split acquires a disproportionate representational burden as the “anchor” that downstream layers query, while suffix tokens carry residual morphological information that may or may not be recoverable depending on attention pattern formation.
Formally, for a word $w$ tokenized into $[v_1, v_2, \ldots, v_m]$, the attended representation at layer $\ell$ is:
$$\mathbf{h}_w^{(\ell)} = \text{Attn}^{(\ell)}(\mathbf{h}_{v_1}^{(\ell-1)}, \ldots, \mathbf{h}_{v_m}^{(\ell-1)})$$
The attention pattern $A_{ij} = \text{softmax}(\mathbf{q}_i \mathbf{k}_j^\top / \sqrt{d_k})$ must align across token boundaries to aggregate morphological features—a task not required for single-token words. Empirical probing studies show that grammatical features (case, number, tense) are better encoded in single-token word representations than in the averaged representations of multi-token splits, suggesting systematic degradation of morphosyntactic information in high-fertility languages.
3.4 Cross-Lingual Fertility Imbalance in Multilingual Models
Multilingual models such as mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) use shared vocabularies trained on multilingual corpora. The fertility imbalance problem arises because vocabulary allocation is proportional to corpus size, heavily favoring high-resource languages. A shared vocabulary of size $|\mathcal{V}|$ allocates effective capacity:
$$C_\ell \approx |\mathcal{V}| \cdot \frac{\text{corpus}_\ell}{\sum_{\ell’} \text{corpus}_{\ell’}}$$
which systematically under-represents low-resource and morphologically rich languages. The consequence is not merely sequence length overhead but representational fragmentation: concepts that are lexically unified in the source language map to disconnected token sequences in the shared vocabulary, weakening cross-lingual alignment in the embedding space.
Rust et al. (2021) quantified this by showing a near-linear relationship between fertility and cross-lingual NER/POS performance degradation across 14 languages, with fertility explaining $R^2 \approx 0.73$ of the variance in task performance gap relative to monolingual baselines. Vocabulary adaptation—replacing the shared vocabulary with a language-specific one while retaining transformer weights—recovered 60–80% of this gap without retraining.
4. Discussion
4.1 Tokenization as a First-Class Design Decision
The evidence reviewed above supports a reframing of tokenization from infrastructure to modeling choice. Current practice treats tokenizer selection as a fixed preprocessing step determined by the foundation model, but this conflates two distinct decisions: the pretraining tokenizer (optimized for the pretraining corpus and compute budget) and the task tokenizer (optimized for downstream task performance in a target domain and language).
These two objectives are not aligned by default. A tokenizer trained on English web text will over-segment Turkish morphology, under-represent Python identifiers, and fragment medical terminology—degrading performance in all three domains even when the underlying model has sufficient capacity. The appropriate response is not simply to use a larger model but to reconsider the tokenization boundary.
Practical interventions include vocabulary extension (adding domain-specific tokens to a pretrained vocabulary), vocabulary replacement (full tokenizer swap with weight re-initialization for new tokens), and tokenizer fine-tuning using domain-specific corpora. Each carries different costs: vocabulary extension is lightweight but limited; replacement requires re-embedding but preserves transformer weights; full retraining is expensive but coherent.
4.2 Implications for Evaluation
The benchmark saturation problem in NLP is partly a tokenization artifact. Tasks evaluated primarily on English with well-tokenized inputs provide an inflated view of model capability that does not transfer to low-resource settings. Tokenization-controlled ablations—where model architecture and training are held fixed while only the tokenizer varies—are underrepresented in the evaluation literature, making it difficult to isolate tokenization contributions to reported performance.
We argue for tokenizer-stratified benchmarks that explicitly report performance as a function of tokenizer fertility, vocabulary coverage, and segmentation consistency. This would make visible the hidden performance penalty paid by non-English and technical-domain users of English-centric models—a cost currently obscured by aggregate scores.
4.3 Learned and Adaptive Tokenization
A promising direction is end-to-end learned tokenization, where segmentation is jointly optimized with the language model objective. Byte-level models such as MegaByte (Yu et al., 2023) sidestep tokenization entirely by operating at the byte level with hierarchical architectures, trading higher sequence length for perfect coverage and zero out-of-vocabulary rate. The architecture recovers compute efficiency through patch-level processing, showing competitive performance on text generation benchmarks while eliminating the tokenization design problem entirely—at the cost of substantially increased architectural complexity.
Intermediate approaches include dynamic tokenization, where segmentation granularity adapts based on input domain or task, and soft tokenization, where character-level embeddings are pooled with learned aggregation weights rather than hard boundaries. These directions suggest that the field is converging toward treating tokenization as a continuous design space rather than a discrete algorithm choice.
5. Conclusion
Tokenization is a structural bottleneck in NLP pipelines whose effects on downstream performance are systematic, measurable, and frequently underestimated. Through the lens of fertility, vocabulary entropy, and embedding geometry, we have characterized three primary failure modes: over-segmentation of morphologically rich languages, vocabulary imbalance in multilingual settings, and token-boundary fragmentation of semantic units. The evidence from benchmark studies establishes that tokenization mismatch can account for performance gaps comparable to those attributed to model architecture choices—yet receives a fraction of the experimental attention.
The path forward requires treating tokenization as a first-class modeling decision subject to principled evaluation and task-specific optimization. Concretely, this means: (1) reporting tokenizer fertility alongside model results in multilingual evaluations; (2) conducting tokenizer ablations in domain adaptation studies; (3) exploring vocabulary adaptation as a lightweight intervention before full retraining; and (4) investing in learned tokenization schemes that remove the human-designed segmentation bottleneck entirely.
As models scale and are deployed across increasingly diverse linguistic and domain contexts, the assumption that English-centric tokenizers are universally adequate becomes progressively less defensible. Rigorous tokenization science is not merely a multilingual concern—it is a prerequisite for robust NLP at scale.
References
- Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of ACL 2016. arXiv:1508.07909.
- Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of EMNLP 2018. arXiv:1808.06226.
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019. arXiv:1810.04805.
- Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., … & Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of ACL 2020. arXiv:1911.02116.
- Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. Proceedings of ACL 2020. arXiv:2004.10964.
- Rust, P., Pfeiffer, J., Vulić, I., Ruder, S., & Gurevych, I. (2021). How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. Proceedings of ACL-IJCNLP 2021. arXiv:2012.15613.
- Xu, F. F., Alon, U., Neubig, G., & Hellendoorn, V. J. (2022). A Systematic Evaluation of Large Language Models of Code. Proceedings of MAPS 2022. arXiv:2202.13169.
- Zouhar, V., Meister, C., Gastaldi, J., Du, L., Sachan, M., & Cotterell, R. (2023). Tokenization and the Noiseless Channel. Proceedings of ACL 2023. arXiv:2306.16842.
- Yu, L., Simber, D., Jain, V., Peng, F., Kenton, J. D., Sohl-Dickstein, J., & Hawthorne, C. (2023). MegaByte: Predicting Million-byte Sequences with Multiscale Transformers. Proceedings of NeurIPS 2023. arXiv:2305.07185.