Abstract Autoregressive language models have achieved remarkable performance across natural language tasks, yet their sequential generation process imposes fundamental latency…
Abstract Transformer-based language models exhibit a fundamental limitation: degraded performance on sequences longer than those encountered during training. This constraint…
Abstract Model merging has emerged as a surprisingly effective technique for combining the capabilities of independently fine-tuned neural networks without…
Abstract Large language models equipped with extended context windows exhibit systematic, position-dependent performance degradation that is not explained by perplexity…
Abstract Layer normalization (LayerNorm) is a cornerstone component of modern transformer architectures, yet its placement relative to attention and feed-forward…
Abstract The Lottery Ticket Hypothesis (LTH), introduced by Frankle and Carlin (2019), proposes that dense neural networks contain sparse subnetworks—termed…