Self Attention for Convolutional Neuronal Networks


Self Attention can help Neuronal Networks to understand long-term dependencies. With normal convolutions, this can be difficult, as the receptive field is small.

Self-Attention GAN (SAGAN) by Zhang et al. (2019) introduces a self-attention mechanism to convolutional GANs leading to clearer features. The issue is that convolutions are good at local information, but the receptive fields are not large enough to include larger structures. Instead of increasing filter sizes, we calculate self-attention.

convolutionfeature maps (x)g(x)h(x)f(x)v(x)transposesoftmax1x1 conv1x1 conv1x1 conv1x1 convattentionmapself attentionfeature maps (o) Structure of Self-Attention in CNN-Networks [Adapted from: Zhang et al. (2019)]

ff, also called the Query Layer (produced by a size 1 conv) is transposed and then combined with gg the Key layer via Dot Product to form the attention map. The Attention map defines the attention score between two pixels. This means how much attention do we pay to region ii when synthesizing jj. By combining hh and the Attention map (via dot product), one gets the Attention Feature Maps (or Attention Weights).

To calculate the output we add the attention weight (scaled by a factor γ\gamma) to the input feature yi=γoi+xi\boldsymbol{y}_{\boldsymbol{i}}=\gamma \boldsymbol{o}_{\boldsymbol{i}}+\boldsymbol{x}_{\boldsymbol{i}}. By starting with γ=0\gamma = 0, we focus on local neighorhoods and then increase it to shift attention to the bigger contexts.

Check out the fastai-Code to get an idea for the implementation.


Bibliography

Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In International conference on machine learning, 7354–7363. PMLR, 2019. 1 2