본문으로 바로가기

VGGNet

1) 개요

    - Investigate the effect of the convolutional network depth on its accuracy.

    - 2nd place at ILSVRC (1st : GoogLeNet). But, it is used alot more than GoogLeNet due to simple structure and generalize well to other dataset.

    - Similar training procedure as AlexNet.

 

2) Depth (small filters, deeper networks)

    - The depth of networks were shallow before 2014 ILSVRC. (less than 8 layers)

    - VGGNet only used 3x3 kernel filter which is the smallest size to capture the notion of left/right, up/down, center)

http://cs231n.stanford.edu/slides/2020/lecture_9.pdf

    - Stack of three 3x3 conv (stride 1) layers has same effective receptive filed as one 7x7 conv layer with fewer parameters.

    - Deeper layers increase non-linearity -> easy to extract useful feature.

   

Architecture

1) Hyper-parameters 

    - Only preprocessing of image is subtracting the mean RGB value computed on the training set.

    - 3x3 kernel filter stride 1, 2x2 max-pooling stride 2.

    - ReLU for activation function.

    - No use of LRN (Local Response Normalisation) -> doesn't improve the performance but the waste of computing power.

    - mini-batch(batch size : 256) gradient descent with momentum 0.8.

    - L2 regularisation 5*10^-4, learning rate 0.01 (decreased by 1/10 when the val set accuracy stopped improving.

 

2) Weight initialization

    - Use VGG-A(11 layers) parameters to resolve vanishing/exploding gradient and reduce training time.

    - First 4 conv layers and the last 3 fully-connected layers of net A weights are used.

 

3) Training

    - 'S' be the training scale.

    - Single scale (fixed S) : VGGNet used 2 single fixed scale (256, 384). Train the network using S = 256 and trained S = 384 with pre-trained weights with S = 256.

    - Multi scale (min = 256, max = 512) : Each training image is indiviually rescaled by randomly sampling S.

    - Objects in images can be of different size. -> scale jittering.

 

4) Testing

    - Use ensembles for best results (Voting).

    - Rescaled to a pre-defined smallest image side 'Q'

    - Q isn't necessarily equal to the training scale S.

 

Results

1) Single scale evaluation (no scale jittering for test)

    - Using LRN  doesn't improve on the model A without any normalization layers.

    - Smaller filters with deeper nets get better results. (additional non-linearity does help)

    - Scale jittering leads to significantly better results.

 

2) Multi scale evaluation

    - scale jittering at test time leads to better performance.

    - Large discrepancy between training and testing scales leads to a drop in performance.

 

3) Multi crop evaluation

    - Using multiple crops performs slightly better than dense evaluation.

    - Dense evaluation and muliple crops are complementary. -> better performance.


Ref.

cs231n.stanford.edu/slides/2020/lecture_9.pdf

bskyvision.com/504

arxiv.org/abs/1409.1556

blog.naver.com/PostView.nhn?blogId=laonple&logNo=220749876381&parentCategoryNo=&categoryNo=22&viewDate=&isShowPopularPosts=false&from=postView