Abstract
The transformer architecture has dominated sequence modeling for half a decade, but its $O(n^2)$ attention complexity imposes a hard ceiling on context length and throughput for long-sequence tasks. State space models (SSMs) offer a principled alternative rooted in control theory: they model sequences through a latent state that is updated recurrently, enabling $O(n)$ inference and $O(n \log n)$ parallel training via convolutional representations. Recent work, culminating in the Mamba architecture (Gu & Dao, 2023), introduced input-dependent (selective) state transitions that allow SSMs to perform content-based filtering analogous to attention — without quadratic cost. This paper provides a rigorous treatment of the SSM framework: we derive the continuous-time and discretized forms, analyze the HiPPO initialization for long-range memory, examine the selectivity mechanism in Mamba, and characterize empirical tradeoffs between SSMs and transformers on language modeling benchmarks. We survey S4, H3, Hyena, and Mamba, and discuss open questions regarding SSM expressiveness, hybrid architectures, and the theoretical relationship to attention.
1. Introduction
Sequence modeling lies at the core of natural language processing, time-series forecasting, audio synthesis, and biological sequence analysis. The transformer (Vaswani et al., 2017) addresses this problem through self-attention: each position attends to all others with learned query-key-value projections, capturing arbitrary pairwise dependencies in a single layer. This expressiveness has proven enormously powerful — essentially every state-of-the-art language model today is transformer-based — but it comes with a fundamental computational cost: the attention matrix is $O(n^2)$ in sequence length $n$, both in memory and in multiply-accumulate operations for each forward pass.
For sequences of length $n \leq 4096$, this cost is manageable. For $n = 100\text{k}$ or $n = 1\text{M}$ — increasingly relevant for document-level reasoning, genomics, and audio — quadratic attention becomes a critical bottleneck. Sparse attention (Beltagy et al., 2020; Zaheer et al., 2020), linear attention (Katharopoulos et al., 2020), and sliding window approaches (Jiang et al., 2023) trade recall of distant context for efficiency. State space models offer a different path: they maintain linear time complexity while preserving a principled mechanism for propagating information across arbitrarily long sequences via a recurrent latent state.
The modern SSM line of research begins with S4 (Gu et al., 2022), which achieved competitive language modeling performance for the first time using structured state space layers initialized with HiPPO matrices. Subsequent work (H3, Hyena, Mamba) progressively addressed the expressiveness gap with attention, culminating in Mamba’s selective state space mechanism that conditions state transition matrices on the input — effectively giving the model the ability to decide, at each step, how much to remember and what to forget.
This paper is organized as follows. Section 2 reviews related work on efficient sequence models. Section 3 derives the SSM formalism from continuous-time linear dynamical systems through discretization and the convolution perspective. Section 4 analyzes the HiPPO initialization and its role in enabling long-range memory. Section 5 presents the Mamba selective SSM mechanism in detail. Section 6 discusses architectural considerations and empirical tradeoffs. Section 7 identifies open problems.
2. Related Work
Vaswani et al. (2017) introduced the transformer with multi-head self-attention and positional encodings. Despite its quadratic complexity, the architecture’s hardware parallelism and expressiveness enabled unprecedented scaling, establishing the foundation of modern LLMs.
Katharopoulos et al. (2020) proposed linear transformers by rewriting the softmax attention kernel as a feature-map decomposition $\phi(Q_i)^\top \phi(K_j)$, enabling the KV product to be accumulated as a running outer product sum. This achieves $O(n)$ time but at the cost of expressiveness: linear attention cannot represent sharp, peaked attention distributions as effectively as softmax attention.
Gu et al. (2022) introduced the Structured State Space Sequence model (S4), demonstrating that a class of linear recurrent models initialized with HiPPO theory can achieve competitive perplexity on long-range dependency benchmarks (Long Range Arena) while running in linear time. S4’s HIPPO-LegS initialization was the key insight enabling stable gradients over thousands of timesteps.
Fu et al. (2023) proposed H3, a hybrid architecture interleaving SSM layers with a small number of attention layers, showing that combining the long-range recall of SSMs with the local pattern-matching of attention yields models competitive with pure transformers on language tasks.
Poli et al. (2023) introduced Hyena, which replaces attention with a hierarchy of long convolutions parameterized by implicit neural networks, achieving sub-quadratic complexity with competitive language modeling results and demonstrating that SSM-like recurrences are not the only viable alternative to attention.
Gu & Dao (2023) introduced Mamba, the most prominent SSM to date, which incorporates input-dependent state transition matrices (selective SSM) and a hardware-aware parallel scan algorithm. Mamba-3B outperforms Transformer++ at the same parameter count on several language benchmarks and achieves linear-time inference with constant-memory recurrent mode.
3. Technical Analysis
3.1 Continuous-Time State Space Models
A continuous-time linear state space model is defined by the system of differential equations:
$$\dot{h}(t) = Ah(t) + Bx(t)$$
$$y(t) = Ch(t) + Dx(t)$$
where $x(t) \in \mathbb{R}$ is the scalar input, $h(t) \in \mathbb{R}^N$ is the latent state, $y(t) \in \mathbb{R}$ is the output, and $A \in \mathbb{R}^{N \times N}$, $B \in \mathbb{R}^{N \times 1}$, $C \in \mathbb{R}^{1 \times N}$, $D \in \mathbb{R}$ are learnable parameters. The $D$ term (skip connection) is often set to zero or treated separately. The matrix $A$ governs the dynamics of the latent state: its eigenvalue structure determines the timescales of memory decay.
For discrete sequences, we must discretize this system. Using the zero-order hold (ZOH) method with step size $\Delta$ (the timescale parameter):
$$\bar{A} = e^{\Delta A}, \quad \bar{B} = (\Delta A)^{-1}(e^{\Delta A} – I) \cdot \Delta B$$
The bilinear approximation, used in S4, gives the simpler:
$$\bar{A} = \left(I – \frac{\Delta}{2}A\right)^{-1}\left(I + \frac{\Delta}{2}A\right), \quad \bar{B} = \left(I – \frac{\Delta}{2}A\right)^{-1} \Delta B$$
The resulting discrete recurrence is:
$$h_t = \bar{A}h_{t-1} + \bar{B}x_t, \quad y_t = Ch_t$$
3.2 The Convolutional Perspective and Efficient Training
Unrolling the discrete recurrence from $h_0 = 0$ reveals that the output $y_t$ is a linear function of all inputs $x_0, \ldots, x_t$:
$$y_t = \sum_{k=0}^{t} C\bar{A}^k \bar{B} x_{t-k} = (K * x)_t$$
where $K \in \mathbb{R}^L$ is the SSM kernel: $K_k = C\bar{A}^k\bar{B}$ for $k = 0, 1, \ldots, L-1$. The entire output sequence can therefore be computed as a single discrete convolution $y = K * x$, evaluated in $O(L \log L)$ time via the Fast Fourier Transform. This is the key insight that enables parallel training: despite being a recurrent model at inference time, an SSM with time-invariant parameters trains as a 1D convolutional model.
The memory cost of storing the kernel $K$ grows as $O(L)$, and the FFT-based convolution requires $O(L \log L)$ operations and $O(L)$ memory — both far superior to the $O(L^2)$ scaling of attention.
3.3 HiPPO Initialization for Long-Range Memory
Naively initializing $A$ randomly leads to unstable gradients over long sequences: eigenvalues inside the unit circle cause exponential forgetting; eigenvalues near or outside the unit circle cause exploding gradients. The HiPPO (High-order Polynomial Projection Operators) framework (Gu et al., 2020) provides a principled initialization for $A$ derived from the requirement that $h(t)$ optimally represents the history of $x(s)$ for $s \leq t$ as a projection onto a basis of orthogonal polynomials.
For the Legendre polynomial basis (HiPPO-LegS), the continuous-time matrix $A$ has the closed form:
$$A_{nk} = -\begin{cases} (2n+1)^{1/2}(2k+1)^{1/2} & n > k \\ n+1 & n = k \\ 0 & n < k \end{cases}$$
This initialization ensures that the SSM kernel $K_k = C\bar{A}^k\bar{B}$ decays polynomially rather than exponentially with $k$, enabling gradient flow and memory retention over hundreds to thousands of timesteps. The HiPPO-LegS initialization is the theoretical foundation that made S4’s empirical success on Long Range Arena possible.
3.4 Mamba’s Selective State Space Mechanism
The core limitation of time-invariant SSMs (S4 and its predecessors) is that the matrices $\bar{A}$ and $\bar{B}$ do not depend on the input $x_t$. This means the model applies the same dynamics regardless of content, limiting its ability to selectively attend to or ignore specific tokens — a capability that attention achieves through its $QK$ dot product. Mamba introduces selectivity by making $B$, $C$, and $\Delta$ functions of the input:
$$B_t = s_B(x_t), \quad C_t = s_C(x_t), \quad \Delta_t = \tau_\Delta(\text{Softplus}(s_\Delta(x_t)))$$
where $s_B$, $s_C$, $s_\Delta$ are linear projections and $\tau_\Delta$ is a learnable scalar. The resulting model has input-dependent transition matrices $\bar{A}_t = e^{\Delta_t A}$ and $\bar{B}_t = (e^{\Delta_t A} – I)A^{-1}B_t$. This breaks the convolutional interpretation: a selective SSM cannot be computed as a single global convolution, because the kernel $K_k$ now varies with position.
Mamba addresses this computational challenge with a parallel selective scan (also called parallel prefix scan or work-efficient scan). For a sequence of state updates $h_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t$, the outputs can be computed in $O(n \log n)$ time using the associativity of the binary operation $(a_1, b_1) \oplus (a_2, b_2) = (a_1 a_2, a_2 b_1 + b_2)$ over the $(\bar{A}_t, \bar{B}_t x_t)$ pairs. Mamba implements this with a hardware-aware algorithm that avoids materializing the full state sequence in HBM (GPU high-bandwidth memory), instead computing the scan in SRAM and achieving memory efficiency comparable to FlashAttention.
The selectivity mechanism provides an intuitive analog to attention: when $\Delta_t$ is large, $\bar{A}_t \approx 0$ and the current input $x_t$ dominates the state update (the model “focuses” on the current token, discarding history). When $\Delta_t$ is small, the previous state $h_{t-1}$ is preserved nearly unchanged (the model “ignores” the current token). This binary behavior — attend or pass — is reminiscent of forget and input gates in LSTMs, but derived from a continuous-time dynamical systems perspective and extended to an $N$-dimensional state space.
4. Discussion
4.1 Expressiveness: Can SSMs Replace Attention?
Despite Mamba’s strong empirical results, theoretical analysis suggests that selective SSMs and attention are not equivalent in expressive power. Attention is complete for associative recall: given a key-value store embedded in the context, a single attention head can retrieve the value associated with any query in one forward pass. SSMs, as linear recurrences, must compress all relevant history into a fixed $N$-dimensional state vector. For tasks requiring precise recall of distant tokens — such as copying a specific string from thousands of tokens ago — the SSM state must have sufficient capacity, and empirical results suggest this is a genuine limitation at moderate state sizes.
Recent work on Mamba-2 (Dao & Gu, 2024) formalizes this connection, showing that the selective SSM can be written as a form of attention with a structured mask, unifying SSMs and attention under the framework of structured state space duality. This equivalence enables hardware-efficient implementations and clarifies that Mamba-2 is, in a precise sense, a restricted form of masked attention — one where the attention matrix is constrained to be expressible as the product of a semiseparable matrix and a scalar per-head factor.
4.2 Hybrid Architectures
The H3 and Jamba (Lieber et al., 2024) architectures suggest that hybrid models — combining a majority of SSM layers with a small number of full attention layers — may achieve the best of both worlds: the linear-time throughput of SSMs for the bulk of computation, with the unrestricted associative recall of attention at strategic positions. Jamba, a 52B parameter hybrid Mamba-Transformer model, demonstrates that this design achieves competitive language modeling quality while fitting in GPU memory at context lengths that would be infeasible for pure transformers of equivalent size.
4.3 Practical Performance
At matched parameter counts and training budgets, Mamba achieves lower perplexity than standard transformers on certain benchmarks (The Pile, Slimpajama) and competitive performance on others. The primary practical advantages are (a) constant-memory inference (the recurrent state $h_t$ is $O(Nd)$ independent of sequence length, enabling unbounded context windows without KV-cache growth), and (b) higher throughput at long sequence lengths. The primary disadvantage is that selective scan is less hardware-optimized than FlashAttention on current accelerators, and SSMs are harder to scale effectively — the optimal state dimension $N$, number of heads, and $\Delta$ parameterization remain active research questions.
5. Conclusion
State space models represent a theoretically grounded and practically competitive alternative to attention-based transformers for sequence modeling. The progression from S4’s HiPPO-initialized time-invariant recurrences to Mamba’s selective input-dependent dynamics demonstrates that the expressiveness gap with attention can be substantially closed while preserving linear-time complexity. The structured state space duality perspective emerging from Mamba-2 suggests that SSMs and attention are not fundamentally distinct classes but rather points in a broader design space of structured linear recurrences.
Key open problems include: (1) a complete theoretical characterization of the tasks where SSMs are provably weaker than attention (and vice versa); (2) optimal co-design of state dimension, selectivity parameterization, and training curriculum; (3) long-context understanding benchmarks that genuinely require recall beyond the effective memory capacity of SSM state vectors; (4) hardware-level co-design (custom kernels, memory hierarchy optimization) to bring SSM inference throughput closer to the ceiling set by attention-optimized hardware. As context length requirements grow with new applications, SSMs — and hybrid SSM-attention designs — are likely to occupy an increasingly central role in the practical LLM deployment landscape.
References
- Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752.
- Gu, A., Goel, K., & R�, C. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022.
- Gu, A., Dao, T., Ermon, S., Rudra, A., & R�, C. (2020). HiPPO: Recurrent Memory with Optimal Polynomial Projections. NeurIPS 2020.
- Fu, D. Y., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., & R�, C. (2023). Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023.
- Poli, M., Massaroli, S., Nguyen, E., Fu, D. Y., Dao, T., Baccus, S., Bengio, Y., Ermon, S., & R�, C. (2023). Hyena Hierarchy: Towards Larger Convolutional Language Models. ICML 2023.
- Dao, T., & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML 2024.
- Lieber, O., Lenz, B., Bata, H., et al. (2024). Jamba: A Hybrid Transformer-Mamba Language Model. arXiv:2403.19887.