Posts

Scaled Dot-Product Attention as Entropic Optimal Transport

TL;DR: The design of Softmax Attention has been more intuitive and engineering-driven, lacking a firm mathematical basis. This post presents a functional analysis perspective, framing the Softmax function as the exact solution to an entropy-regularized optimal transport problem. This opens up a new theoretical angle for designing attention mechanisms. Derivation: Scaled Dot-Product Attention as Entropic Optimal Transport The Softmax function inherent in modern attention mechanisms is not an arbitrary heuristic. It represents the unique solution to an entropy-regularized optimization problem, balancing the maximization of similarity against the maximization of distributional uncertainty (entropy). ...

Test Time Training(TTT) and Attention mechanisms

TL;DR: This post summarizes how modern linear attention architectures (e.g., Mamba, RWKV, RetNet) can be unified under the Test-Time Training (TTT) framework. It reviews how TTT leverages Fast Weight Programming to update hidden states on-the-fly via gradient descent (exemplified by the Delta Rule), and contrasts this dynamic memory approach with traditional RNNs and the computationally expensive Self-Attention mechanism. 1. Test Time Training (TTT) 1.1 Formulation of the Test Time Training Test-Time Training (TTT) retrieves data relevant to the input from the training set or knowledge base during inference to fine-tune the model, improving its performance in dynamic scenarios. ...

Analyzing Sparse Attention Approximation With SNR

TL;DR: Linear Attention fails when attention distributions are “peaky” (high dynamic range) because it acts as a low-pass filter. Sparse Attention approximates Softmax by keeping an Active Set ($\mathcal{A}_i$) of high scores and discarding a Pruned Set ($\mathcal{A}_i^c$). The approximation error is determined by the Residual Weight ($\epsilon_i$)—the probability mass lost in the pruned set. We model this using Signal-to-Noise Ratio (SNR): High SNR (peaky distribution) yields low error, while low SNR (flat distribution) leads to significant bias. The method is exponentially sensitive to Recall Failure: missing even a single high-scoring token (where $s_{\max}^c \approx m_i$) causes the SNR to collapse and the error to spike. ...

Analyzing Linear Attention as a First-Order Approximation of Softmax

TL;DR: This post explores why linear attention can be seen as a first-order Taylor approximation of the standard Softmax function ($e^x \approx 1+x$). We show that this approximation is only accurate when the attention scores for a given query are tightly clustered around their maximum value. The error of this approximation grows quadratically with the range of the scores, explaining why linear attention often fails to match Softmax performance. We propose that by filtering out low-scoring keys before applying the linear approximation, we can create a hybrid mechanism that constrains the score range, thereby improving accuracy while retaining some efficiency gains. ...