Scaled Dot-Product Attention as Entropic Optimal Transport
TL;DR: The design of Softmax Attention has been more intuitive and engineering-driven, lacking a firm mathematical basis. This post presents a functional analysis perspective, framing the Softmax function as the exact solution to an entropy-regularized optimal transport problem. This opens up a new theoretical angle for designing attention mechanisms. Derivation: Scaled Dot-Product Attention as Entropic Optimal Transport The Softmax function inherent in modern attention mechanisms is not an arbitrary heuristic. It represents the unique solution to an entropy-regularized optimization problem, balancing the maximization of similarity against the maximization of distributional uncertainty (entropy). ...