Analyzing Sparse Attention Approximation With SNR
TL;DR: Linear Attention fails when attention distributions are “peaky” (high dynamic range) because it acts as a low-pass filter. Sparse Attention approximates Softmax by keeping an Active Set ($\mathcal{A}_i$) of high scores and discarding a Pruned Set ($\mathcal{A}_i^c$). The approximation error is determined by the Residual Weight ($\epsilon_i$)—the probability mass lost in the pruned set. We model this using Signal-to-Noise Ratio (SNR): High SNR (peaky distribution) yields low error, while low SNR (flat distribution) leads to significant bias. The method is exponentially sensitive to Recall Failure: missing even a single high-scoring token (where $s_{\max}^c \approx m_i$) causes the SNR to collapse and the error to spike. ...