Model Merging and Weight Interpolation: Task Vectors, SLERP, and the Geometry of Combining Fine-Tuned Language Models

Abstract

Model merging has emerged as a surprisingly effective technique for combining the capabilities of independently fine-tuned neural networks without additional training. Rather than treating model weights as opaque parameters, recent work frames fine-tuning as displacement in weight space—a task vector—enabling arithmetic operations that compose skills, eliminate undesired behaviors, or extrapolate beyond the training distribution. This paper surveys the theoretical foundations of weight interpolation, analyzes spherical linear interpolation (SLERP) as a geometrically principled alternative to linear blending, and examines task arithmetic, TIES-merging, and DARE as progressively more sophisticated approaches to resolving interference between task-specific weight directions. We discuss the conditions under which merging succeeds or fails, the role of the loss landscape’s geometry in enabling parameter-space superposition, and open questions regarding scaling, modality generalization, and the theoretical limits of gradient-free model composition.

1. Introduction

The conventional paradigm for adapting a pre-trained language model to multiple tasks requires either multi-task fine-tuning—which demands simultaneous access to all datasets—or sequential fine-tuning—which suffers from catastrophic forgetting. Both approaches are computationally expensive and operationally rigid. A fundamentally different strategy has gained traction: model merging, the direct manipulation of weight-space representations to combine independently trained checkpoints.

At its core, model merging exploits a structural observation about the fine-tuning process. When a pre-trained model is fine-tuned on task $\mathcal{T}$, the weight update $\tau = \theta_{ft} – \theta_{pre}$ encodes task-specific knowledge as a vector in parameter space. If these task vectors are approximately orthogonal across tasks—a condition that holds empirically with surprising regularity—then their superposition yields a model with multi-task competence without any gradient computation.

The appeal is practical as well as theoretical. Given a library of specialized fine-tuned checkpoints, practitioners can create hybrid models serving multiple downstream applications at the cost of a single inference pass. This unlocks a new class of model customization: post-hoc capability composition. Organizations can fine-tune models on proprietary datasets, share only task vectors (which are smaller than full checkpoints), and merge capabilities on demand.

This paper organizes the landscape of model merging along three dimensions: the interpolation geometry (linear versus spherical), the treatment of parameter interference (dense versus sparse merging), and the scope of composable operations (addition, negation, scaling). Section 2 surveys foundational and recent work. Section 3 provides a technical analysis of the principal merging algorithms. Section 4 discusses empirical findings and failure modes. Section 5 concludes with open research directions.

2. Related Work

The study of neural network weight interpolation has a long history, though its modern resurgence is tied to the scale and modularity of large pre-trained models.

Frankle et al. (2020) demonstrated that linear interpolation between two independently trained networks with shared initialization produces low-loss paths in weight space—a phenomenon they termed linear mode connectivity. This result was surprising because generic loss landscapes are non-convex, yet fine-tuned variants of the same base model inhabit a relatively flat basin where interpolation is benign. This observation laid the theoretical groundwork for merging by establishing that weight-space paths between related models do not necessarily cross high-loss barriers.

Wortsman et al. (2022) introduced model soups, showing that averaging multiple fine-tuned checkpoints—produced by different hyperparameter configurations of the same base model—reliably improves generalization over any individual checkpoint. The averaged model outperforms the best individual ingredient on both in-distribution and out-of-distribution benchmarks, suggesting that weight averaging implicitly performs a form of ensembling in function space while retaining the inference cost of a single model.

Ilharco et al. (2023) formalized task arithmetic, defining task vectors as the signed difference between fine-tuned and pre-trained weights: $\tau_i = \theta_i – \theta_0$. They demonstrated that arithmetic operations on task vectors—addition for multi-task composition, negation for capability removal, scaling for intensity control—transfer predictably to model behavior. Task addition merges capabilities; task negation suppresses behaviors; task scaling interpolates between the pre-trained and fine-tuned regimes along the task direction.

Yadav et al. (2023) identified parameter interference as the principal failure mode of naive task arithmetic, proposing TIES-merging (Trim, Elect Sign, Disjoint Merge). Their method trims small-magnitude parameters (which contribute noise rather than task signal), resolves sign conflicts by majority vote, and merges only parameters with consistent signs. TIES substantially improves over task arithmetic on challenging multi-task benchmarks.

Yu et al. (2024) proposed DARE (Drop And REscale), a complementary approach that randomly zeros out task vector parameters with probability $p$ and rescales the remainder by $1/(1-p)$ before merging. This stochastic pruning reduces interference without explicit sign conflict resolution, drawing an analogy to dropout during training. DARE can be combined with TIES for further gains.

Goddard et al. (2024) conducted a systematic empirical evaluation of merging strategies across language models ranging from 7B to 70B parameters, demonstrating that SLERP—applied to the full parameter vectors of two models—consistently outperforms linear interpolation for pairwise merging due to its preservation of weight vector norms.

3. Technical Analysis

3.1 Linear Interpolation and Its Limitations

The simplest merging strategy linearly interpolates between two weight tensors:

$$\theta_{merge} = (1 – \lambda)\, \theta_A + \lambda\, \theta_B, \quad \lambda \in [0,1]$$

While effective when models share a common base and occupy a flat loss basin, linear interpolation has a geometric deficiency: it does not preserve the norm of the interpolated vector. For unit vectors $\mathbf{u}$ and $\mathbf{v}$, the midpoint $(\mathbf{u} + \mathbf{v})/2$ has norm $\cos(\theta/2)$ where $\theta$ is the angle between them, strictly less than 1 when $\theta > 0$. Since weight norms encode scale information that affects layer outputs, this shrinkage introduces a systematic bias that compounds across layers.

3.2 Spherical Linear Interpolation (SLERP)

SLERP interpolates along the geodesic on the unit hypersphere, preserving norms throughout the path:

$$\text{SLERP}(\mathbf{u}, \mathbf{v}; \lambda) = \frac{\sin((1-\lambda)\Omega)}{\sin \Omega}\, \mathbf{u} + \frac{\sin(\lambda \Omega)}{\sin \Omega}\, \mathbf{v}$$

where $\Omega = \arccos(\mathbf{u} \cdot \mathbf{v})$ is the angle between the two vectors. When applied to full weight tensors (flattened into a single vector), SLERP maintains the angular trajectory between the two model checkpoints while preserving total parameter magnitude. In the limit $\Omega \to 0$, SLERP reduces to linear interpolation, ensuring numerical stability for nearly-identical models.

The norm-preservation property is particularly important for attention weight matrices, where the softmax operation is sensitive to the scale of logits. A layer with systematically smaller weight norms will produce softer attention distributions, effectively reducing the model’s ability to focus—a qualitative behavioral shift not present in either source model.

3.3 Task Arithmetic: Vector Operations in Parameter Space

Task arithmetic treats parameter space as a vector space over task vectors $\tau_i = \theta_i – \theta_0$, where $\theta_0$ is the pre-trained base. The merged model is:

$$\theta_{merge} = \theta_0 + \sum_{i=1}^{n} \alpha_i \tau_i$$

The scaling coefficients $\alpha_i$ control the contribution of each task. Setting $\alpha_i = 1$ recovers full task fidelity (modulo interference); $\alpha_i < 1$ interpolates toward the base; $\alpha_i < 0$ performs negation, suppressing the task's learned behaviors. This last operation is particularly striking: one can fine-tune a model on toxic content and then subtract the resulting task vector from a target model, reducing toxicity without any safety fine-tuning of the target itself.

The success of task arithmetic rests on the approximate orthogonality of task vectors. If $\tau_i \perp \tau_j$, their sum preserves each component exactly. In practice, cosine similarities between task vectors from different domains (e.g., coding vs. sentiment analysis) are small but non-zero, and this residual interference accumulates across the $d$-dimensional parameter space (where $d \sim 10^9$ for modern LLMs). Managing this interference is the central challenge.

3.4 TIES-Merging: Resolving Sign Conflicts

TIES-merging proceeds in three stages for each scalar parameter $\theta^k$:

Trim: Set $\tau_i^k = 0$ if $|\tau_i^k| < \epsilon_i$ (retain only the top-$p$ fraction by magnitude per task vector)
Elect: Determine the dominant sign $\gamma^k = \text{sign}\left(\sum_i \tau_i^k\right)$
Disjoint merge: Average only those task vectors that agree with the elected sign:
$$\tau_{merged}^k = \frac{1}{|\mathcal{A}^k|} \sum_{i \in \mathcal{A}^k} \tau_i^k, \quad \mathcal{A}^k = \{i : \text{sign}(\tau_i^k) = \gamma^k\}$$

The trim step eliminates parameters with small task signal—which disproportionately contribute noise during merging—while retaining the high-magnitude directions that encode core task knowledge. The elect-and-merge steps ensure that destructive cancellation between opposite-sign parameters does not neutralize task-specific adaptations.

Empirically, trimming 80-90% of task vector parameters by magnitude retains most task performance while dramatically reducing interference, supporting the hypothesis that task knowledge is concentrated in a sparse subset of parameters—consistent with the lottery ticket hypothesis literature.

3.5 DARE: Stochastic Pruning and Rescaling

DARE applies Bernoulli dropout to task vectors prior to merging. For drop probability $p$:

$$\hat{\tau}_i^k = \frac{m_i^k \cdot \tau_i^k}{1-p}, \quad m_i^k \sim \text{Bernoulli}(1-p)$$

The rescaling by $1/(1-p)$ is an unbiasedness condition analogous to inverted dropout in training, ensuring $\mathbb{E}[\hat{\tau}_i^k] = \tau_i^k$. The stochasticity introduces variance but reduces expected interference by randomly zeroing conflicting parameters. In expectation over the random mask, DARE approximates a rank-reduction of the task vectors by projecting onto a random sparse support.

The theoretical connection to random projections is instructive: by the Johnson-Lindenstrauss lemma, random projection approximately preserves pairwise distances between $n$ points in $O(\log n / \epsilon^2)$ dimensions. DARE exploits this to preserve individual task vectors in expectation while reducing their mutual overlap in the projected (sparse) space.

3.6 Model Breadcrumbs and Evolutionary Merging

Recent work has extended beyond fixed merging recipes toward optimization over merging hyperparameters. Evolutionary model merging (Akiba et al., 2024) treats the merging coefficients and layer-assignment decisions as a search problem, using evolutionary algorithms to find configurations that maximize performance on target tasks without gradient-based fine-tuning. This approach discovers non-obvious merging configurations, including cross-layer merging where attention weights from one model are combined with feed-forward weights from another.

4. Discussion

4.1 When Does Merging Work?

The empirical success of model merging is not universal, and understanding its failure conditions is as important as characterizing its successes. Three primary conditions govern mergeability:

Shared initialization: Models must share a common pre-trained base. Merging independently trained models with different architectures or random initializations fails because their weight spaces have no common frame of reference. Fine-tuned models from the same base occupy a shared loss basin precisely because they inherit the same coordinate system from pre-training.

Task diversity and orthogonality: Merging works best when task vectors are approximately orthogonal—i.e., when tasks require non-overlapping parameter modifications. Closely related tasks (e.g., sentiment classification on two different domains) produce highly correlated task vectors with significant interference. Distantly related tasks (e.g., mathematical reasoning and code generation) tend to have more orthogonal task vectors, making merging more effective.

Magnitude balance: Task vectors with vastly different norms produce imbalanced merges where the larger task dominates. Normalizing task vectors to unit norm before scaling by $\alpha_i$ is a common mitigation, though it discards information about fine-tuning intensity.

4.2 The Role of Loss Landscape Geometry

The theoretical underpinning of model merging is intimately tied to the geometry of the loss landscape around the pre-trained basin. Neyshabur et al. (2020) showed that fine-tuning from a common pre-trained model tends to remain in a low-dimensional subspace of the full parameter space, a phenomenon they termed intrinsic dimensionality. If task vectors lie in a low-dimensional subspace $V \subset \mathbb{R}^d$, then for $n$ tasks with $n \ll d / \ dim(V)$, the probability of destructive interference is low simply because $d$ is enormous relative to the information content of each task vector.

For GPT-class models with $d \sim 7 \times 10^9$ parameters and task vectors that span intrinsic dimensionalities of $\sim 10^4$–$10^5$, the ambient space is sufficiently high-dimensional to accommodate hundreds of approximately orthogonal task vectors—a form of parameter-space superposition directly analogous to the superposition hypothesis in mechanistic interpretability.

4.3 Task Negation and Safety Applications

Task negation—subtracting a task vector—has attracted attention for safety applications. Ilharco et al. demonstrated that subtracting a “detoxification” task vector from a target model reduces toxic outputs without explicit safety training. Conversely, subtracting a “harmful behavior” task vector (if one can be isolated) offers a post-hoc safety intervention. However, this approach faces the challenge of precisely isolating the task vector: fine-tuning for a specific behavior modifies parameters that also serve other functions, so negation may degrade unrelated capabilities.

This limitation reveals a fundamental tension in task arithmetic: task vectors are not perfectly modular. The pre-training loss landscape induces correlations between behavioral dimensions that prevent clean surgical removal of individual capabilities. Future work on disentangled fine-tuning representations may be necessary to fully realize negation as a safety tool.

4.4 Scaling Behavior and Large Models

An important empirical observation is that merging effectiveness appears to improve with model scale. Larger models seem to have higher effective intrinsic dimensionality, allowing more tasks to be merged before interference becomes significant. This may reflect the broader capability of large models to find lower-interference fine-tuning solutions, consistent with the observation that larger models achieve better fine-tuning performance with fewer parameter modifications (as evidenced by LoRA rank requirements scaling sublinearly with model size).

This scale-dependence creates an interesting practical dynamic: model merging is most valuable precisely where it works best—at scale—where the cost of multi-task fine-tuning is prohibitive.

4.5 LoRA Merging: A Special Case

When fine-tuning is performed with LoRA (Hu et al., 2022), task vectors have an additional structure: $\tau = BA$ where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times d}$, and $r \ll d$. Merging in the full parameter space (after projecting $BA$ to the weight matrix) discards this low-rank structure. Recent work on LoRA merging in the factored space (e.g., LoRA Hub, Zhang et al., 2023) preserves low-rank structure during composition, enabling merging without full weight materialization and reducing memory costs during the merge computation itself.

5. Conclusion

Model merging has established itself as a legitimate and practically valuable paradigm for neural network composition. The theoretical framework of task arithmetic—treating fine-tuning as vector operations in weight space—provides a clean conceptual language for capability composition, negation, and scaling. SLERP addresses the geometric deficiencies of linear interpolation for pairwise merging; TIES and DARE address the interference problem for multi-task composition at scale.

Several open questions merit further investigation. First, the theoretical conditions for successful merging—approximate orthogonality, shared basin structure, magnitude balance—are well-characterized empirically but lack tight theoretical guarantees. A formal analysis connecting loss landscape curvature to merge-induced performance degradation would be valuable. Second, the interaction between merging and training methodology is poorly understood: does RLHF fine-tuning produce task vectors with different orthogonality properties than supervised fine-tuning? Third, the extension to non-language modalities (vision, audio, multimodal) has been explored but not systematically analyzed.

Perhaps most provocatively, model merging challenges the assumption that neural network capabilities are monolithic properties of a training run. If capabilities can be cleanly separated, combined, and subtracted as vectors, then the space of trained models has a compositional structure that current training paradigms do not explicitly exploit. Future work on training models for mergeability}—optimizing for low-interference task vectors—may unlock a qualitatively more modular approach to capability development.

References

Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., & Farhadi, A. (2023). Editing models with task arithmetic. International Conference on Learning Representations (ICLR).

Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S., … & Schmidt, L. (2022). Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. International Conference on Machine Learning (ICML).

Yadav, P., Tam, D., Choshen, L., Raffel, C., & Bansal, M. (2023). TIES-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems (NeurIPS).

Yu, L., Yu, B., Yu, H., Huang, F., & Li, Y. (2024). DARE: Language model weight pruning via random dropping. arXiv preprint arXiv:2311.03099.

Frankle, J., Dziugaite, G. K., Roy, D., & Carlin, M. (2020). Linear mode connectivity and the lottery ticket hypothesis. International Conference on Machine Learning (ICML).

Akiba, T., Sano, M., Yanase, T., Ohta, T., & Koyama, M. (2024). Evolutionary optimization of model merging recipes. arXiv preprint arXiv:2403.13187.

Neyshabur, B., Sedghi, H., & Zhang, C. (2020). What is being transferred in transfer learning? Advances in Neural Information Processing Systems (NeurIPS).

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., … & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. International Conference on Learning Representations (ICLR).

Goddard, C., Siriwardhana, S., Ehghaghi, M., Meyers, L., Karpukhin, V., Benedict, B., … & Labrak, Y. (2024). Arcee’s MergeKit: A toolkit for merging large language models. arXiv preprint arXiv:2403.13257.

Zhang, S., Gong, L., Shen, Y., Liu, J., & Chen, W. (2023). LoraHub: Efficient cross-task generalization via dynamic LoRA composition. arXiv preprint arXiv:2307.13269.