Blog - Arda Tuğsat

General March 23, 2026

Automatic Prompt Optimization: Discrete Search, Gradient Approximation, and the Geometry of Instruction Space

Abstract Prompt engineering has emerged as a critical but largely manual process for eliciting desired behaviors from large language models…

General March 23, 2026

Knowledge Localization and Model Editing in Large Language Models: Causal Tracing, ROME, MEMIT, and the Geometry of Factual Memory

Abstract Large language models (LLMs) encode vast amounts of factual knowledge in their parameters during pretraining. Understanding where and how…

General March 23, 2026

Diffusion Language Models: Score Matching, Masked Diffusion, and the Non-Autoregressive Frontier

Abstract Autoregressive language models have achieved remarkable performance across natural language tasks, yet their sequential generation process imposes fundamental latency…

General March 23, 2026

Context Window Extension in Transformers: Position Interpolation, ALiBi, YaRN, and the Length Generalization Problem

Abstract Transformer-based language models exhibit a fundamental limitation: degraded performance on sequences longer than those encountered during training. This constraint…

General March 23, 2026

Model Merging and Weight Interpolation: Task Vectors, SLERP, and the Geometry of Combining Fine-Tuned Language Models

Abstract Model merging has emerged as a surprisingly effective technique for combining the capabilities of independently fine-tuned neural networks without…

General March 23, 2026

Attention Sinks and the Lost-in-the-Middle Phenomenon: Position Biases, Recency Effects, and Long-Context Failure Modes in Large Language Models

Abstract Large language models equipped with extended context windows exhibit systematic, position-dependent performance degradation that is not explained by perplexity…

General March 23, 2026

Layer Normalization in Transformers: Pre-LN vs. Post-LN, Gradient Flow, and the Mathematics of Training Stability

Abstract Layer normalization (LayerNorm) is a cornerstone component of modern transformer architectures, yet its placement relative to attention and feed-forward…

General March 23, 2026

The Lottery Ticket Hypothesis: Sparse Subnetworks, Pruning Theory, and the Geometry of Trainable Neural Architectures

Abstract The Lottery Ticket Hypothesis (LTH), introduced by Frankle and Carlin (2019), proposes that dense neural networks contain sparse subnetworks—termed…

General March 23, 2026

Group Relative Policy Optimization (GRPO): Reward-Driven Reasoning, Variance Reduction, and the Mathematics of Self-Improving Language Models

Abstract Reinforcement learning from verifiable rewards has emerged as a powerful paradigm for training large language models (LLMs) to reason…