TL;DR: Linear Attention fails when attention distributions are “peaky” (high dynamic range) because it acts as a low-pass filter. Sparse Attention approximates Softmax by keeping an Active Set ($\mathcal{A}_i$) of high scores and discarding a Pruned Set ($\mathcal{A}_i^c$). The approximation error is determined by the Residual Weight ($\epsilon_i$)—the probability mass lost in the pruned set. We model this using Signal-to-Noise Ratio (SNR): High SNR (peaky distribution) yields low error, while low SNR (flat distribution) leads to significant bias. The method is exponentially sensitive to Recall Failure: missing even a single high-scoring token (where $s_{\max}^c \approx m_i$) causes the SNR to collapse and the error to spike. ...
Analyzing Linear Attention as a First-Order Approximation of Softmax
TL;DR: This post explores why linear attention can be seen as a first-order Taylor approximation of the standard Softmax function ($e^x \approx 1+x$). We show that this approximation is only accurate when the attention scores for a given query are tightly clustered around their maximum value. The error of this approximation grows quadratically with the range of the scores, explaining why linear attention often fails to match Softmax performance. We propose that by filtering out low-scoring keys before applying the linear approximation, we can create a hybrid mechanism that constrains the score range, thereby improving accuracy while retaining some efficiency gains. ...