SE-Net Implementation

Understanding and implementing SE-Networks

Jan 24, 2024

(Originally written on June 5th 2023)

Squeeze-and-Excitation Networks, SE-Nets for short, are actually convolutional blocks that can be added to other models, like ResNet or VGG.

Link to paper: https://arxiv.org/pdf/1709.01507.pdf

Link to my code: https://github.com/boosungkim/Paper-Implementations

The key problem that the authors of the paper wants to address is the problem of implicit and local channel dependencies. They do so by adding the Squeeze-and-Excitation block, which relays channel-wise information.

By incorporating the SE blocks into the model, the network can adaptively recalibrate its feature maps to capture more discriminative information, leading to improved performance.

Implicit local channel information

Figure 2: A normal convolution

In a normal convolution like the image above, channel dependencies are implicitly included in the outputs of the convolutional layers. In other words, each layer calculates the convolution on all the channels of a local region every step.

Due to the localness of the convolutions, each channel in the output contains implicit channel embeddings tangled with local spatial correlations. To simplify further, each pixel in a channel contains the channel embeddings of the local region on the convolution was calculated on.

The SE-Net

The SE-Net sets out to resolve this issue by introducing an explicit mechanism to model channel-wise relationships through the "squeeze" and "excitation" layers.

Squeeze

The network first "squeezes" the outputs of the previous convolutional layer into channel*1*1 shape using Global Average Pool.

Excitation

The network then performs "excitation" by performing two Fully Connected (FC) layers. The first FC layer reduces the number of channels by applying a reduction ratio. This reduction helps in reducing the computational complexity of the SE block. The second FC layer then expands the number of channels back to the original number. These FC layers capture the channel dependencies and learn channel-wise relationships based on the aggregated information from the squeeze operation.

FYI, because the FC layers are "fully connected," every node is connected with each other. This is how the network captures channel-wise relationships and dependencies.

Rescale

Finally, the SE-Net rescales the output back to the input dimensions by using the Unflatten operation and channel-wise multiplication. The channel-wise attention weights are then applied by element-wise multiplication, allowing the network to selectively amplify or suppress the channel activations based on their importance.

SE-Net implementation

Implementation

class se_block(nn.Module):

def __init__(self, input_channels, reduction_ratio):

super(se_block,self).__init__()

self.sequence = nn.Sequential(

# 1. Squeeze

 nn.AdaptiveAvgPool2d((1,1)), # output: bxCx1x1

 nn.Flatten(), # output: bxC

# 2. Excitation

 nn.Linear(input_channels, input_channels // reduction_ratio, bias=False), # output: bxC/r

 nn.ReLU(), # output: bxC/r

 nn.Linear(input_channels // reduction_ratio, input_channels), # output: bxC

 nn.Sigmoid(), # output: bxC

 nn.Unflatten(1, (input_channels,1,1)) # output: bxCx1x1

 )

pass

def forward(self,x):

 z = self.sequence(x)

# 3. Rescale

 z = x*z # output: bxCxHxW

return z

There is not much to add here, as the code follows the description one-to-one. One thing to add is that I originally used Pytorch's AvgPool2d with manually calculated channel width and height, but Pytorch has the AdaptiveAvgPool2d which handles the dimensions for you.

Network summary

>>> model = se_block(64, 16)

>>> summary(model, input_size=(1,64,32,32), col_names=["input_size","output_size","num_params"])



===================================================================================================================

Layer (type:depth-idx) Input Shape Output Shape Param #

===================================================================================================================

se_block [1, 64, 32, 32] [1, 64, 32, 32] --

├─Sequential: 1-1 [1, 64, 32, 32] [1, 64, 1, 1] --

│ └─AdaptiveAvgPool2d: 2-1 [1, 64, 32, 32] [1, 64, 1, 1] --

│ └─Flatten: 2-2 [1, 64, 1, 1] [1, 64] --

│ └─Linear: 2-3 [1, 64] [1, 4] 256

│ └─ReLU: 2-4 [1, 4] [1, 4] --

│ └─Linear: 2-5 [1, 4] [1, 64] 320

│ └─Sigmoid: 2-6 [1, 64] [1, 64] --

│ └─Unflatten: 2-7 [1, 64] [1, 64, 1, 1] --

===================================================================================================================

Total params: 576

Trainable params: 576

Non-trainable params: 0

Total mult-adds (Units.MEGABYTES): 0.00

===================================================================================================================

Input size (MB): 0.26

Forward/backward pass size (MB): 0.00

Params size (MB): 0.00

Estimated Total Size (MB): 0.27

===================================================================================================================

You may notice that the number of parameters is relatively low due to the use the reduction ratio. Thus, the inclusion of SE-Nets in other models will not significantly impact the number of total parameters.

SE-ResNet

With the SE-Net coded, it is trivial to add it to our previous models. I decided to test it on ResNet.

class SE_ResidualBlockBottleneck(nn.Module):

 expansion = 4

def __init__(self, input_channels, in_channels, reduction_ratio, stride=1):

super(SE_ResidualBlockBottleneck, self).__init__()

self.block = nn.Sequential(

 nn.Conv2d(in_channels=input_channels, out_channels=in_channels, kernel_size=1, stride=1, padding=0, bias=False),

 nn.BatchNorm2d(in_channels),

 nn.ReLU(),

 nn.Conv2d(in_channels=in_channels, out_channels=in_channels, kernel_size=3, stride=stride, padding=1, bias=False),

 nn.BatchNorm2d(in_channels),

 nn.ReLU(),

 nn.Conv2d(in_channels=in_channels, out_channels=in_channels*4, kernel_size=1, stride=1, padding=0, bias=False),

 nn.BatchNorm2d(in_channels*4)

 )

self.relu = nn.ReLU()

self.se_block = SE_block(in_channels*4, reduction_ratio)

if stride != 1 or input_channels != self.expansion*in_channels:

self.shortcut = nn.Sequential(

 nn.Conv2d(in_channels=input_channels, out_channels=in_channels*4, kernel_size=1, stride=stride, padding=0, bias=False),

 nn.BatchNorm2d(in_channels*4)

 )

else:

self.shortcut = nn.Sequential()

def forward(self, x):

 z = self.block(x)

 z = self.se_block(z)

 z += self.shortcut(x)

 z = self.relu(z)

return z

Notice that the only additions made are self.se_block = SE_block(in_channels*4, reduction_ratio) and z = self.se_block(z). The number of channels is in_channel*4 as that is the result of the residual block (bottleneck).

Experimentation

When run under the same hyperparameters as ResNet, SE-ResNet produces a training accuracy of 95.7% and a testing accuracy of 88.6%. Not quite as a good as DenseNet, but I only ran SE-ResNet for about 400 epoch, significantly less than my runs for ResNet and DenseNet.

Conclusion

The results fall in line with the results of the paper. The accuracies were increased and the network converged faster than the original.

While I only tested the SE-Network on ResNet, the squeeze-and-excitation method can be included in many other models as well.

References:

1. https://towardsdatascience.com/practical-graph-neural-networks-for-molecular-machine-learning-5e6dee7dc003

Boosung’s Substack

Discussion about this post