Deep residual learning for image recognition; ResNet (2015)

1) 개요

    - very deep networks using residual connections

    - 152 layers

    - ILSVRC 15 winner

    - Between 20 and 56 layers 'plain' convolutional networks, the deeper network has higher training error and test error.>

    - Try to resolve degradation problem with deeper network which isn't caused by overfitting.



Residual Block

    - Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping

    - Former method was getting the optimal H(x) by passing through just 2 weight layers. But, residual block's goal was getting H(x) - x (difference between output and input) -> weights will be trained to get H(x) - x

       if F(x) = H(x) - x -> H(x) = F(x) + x

    - If identity mappings are optimal (F(x) = 0), simply drive the weights of the multiple nonliear layers toward 0.

    - No increase of computing power (no extra parameter)

    - possible to optimize deeper net

    - dimensions of x and F(x) must be equal -> If not, perform as linear projection Ws each F and x.

       (This will lead time complexity and model size are doubled)



    - stack residual blocks

    - every residual block has two 3x3 conv layers

    - periodically, double # of filters and downsample spatially using stride 2 -> reduce activation volume by half

    - additional conv layer at the beginning (stem)

    - No FC layers at the end (only FC 1000 to output classes)

    - For deeper networks, use 'bottleneck' layer to improve efficiency (put 1x1 conv layer)

Concerns on the training time


Training ResNet

1) hyper parameters    

    - BN after every CONV layer

    - 3x3 conv layer

    - Xavier initialization from He et al.

    - SGD + Momentum (0.9)

    - LR : 0.1, divide by 10 when validation error stop decreasing

    - mini-batch size = 256

    - wegith decay 0.00001

    - no dropout used 

    - only one max-pooling (stride 2 to reduce feature-map size instead of pooling)


2) Experiment

    - VGG-19

    - residual net : with 34 layers shortcut connection per 2 conv layer

    - plain net with 34 layers


Deeper ResNet is faster than plain


2) More than 1000 layers

    - 1202-layer is worse than 110-layer although both have similar training error.

