SSD-Single Shot Detector

February 08, 2020

There is one class of models for localization and object detection, called Single Shot Detectors, which are even faster and require less computational cost in general.

By processing the image only once and output the prediction immediately, these types of models are called Single Shot Detectors.

Single Shot Detectors(SSD):

Instead of having a dedicated system to propose ROIs, we have a set of predefined boxes to look for objects, which are forwarded to a bunch of convolutional layers to predict class scores and bounding box offsets.

Then for each box we predict a number of bounding boxes with a confidence score assigned to each one, we detect one object centered in that box and we output a set of probabilities for each possible class.

Once we have all that, we simply and maybe naively keep only the box with high confidence score.

Details below adopt from:https://towardsdatascience.com/review-ssd-single-shot-detector-object-detection-851a94607d11

MultiBox Detector:

SDD only needs an input image and ground truth boxes for each object during training.

After going through a certain of convolutions for feature extraction, we obtain a feature layer of size mxn (#localizations) with p channels, such as 8x8 or 4x4 above. And a 3x3 conv is applied on this mxnxp feature layer.

For each localization, we got k bounding boxes. These k bounding boxes have different sizes and aspect ratios. (Vertical rectangle->human, horizontal rectangle->car)

For each of the bounding box, we will compute c class scores and 4 offsets relative to the original default bounding box shape.

Finally, we have (c+4)mnk outputs.

SSD Network Architecture:

To have more accurate detection, different layers of feature maps are also going through a small 3x3 convolution for object detection.

SSD has more bounding boxes than that of YOLO.

Loss Function:

The overall objective loss function is a weighted sum of the localization loss(loc) and the confidence loss(conf).

N is #matched default boxes.

Localization loss:

Loc loss is a Smooth L1 loss between the predicted box l and the ground truth box g.

Confidence loss:

The confidence loss is the softmax loss over multiple classes confidences.

Scales and Aspect Ratios of Default Boxes:

Suppose we want to use m feature maps for prediction. The scale of the default boxes for each feature map is computed as above.

Smin = 0.2, Smax = 0.9

which means the lowest layer has a scale of 0.2 and the highest layer has a scale of 0.9, and all layers in between are regularly spaced.

We impose different aspect ratios for the default boxes, and denote them as

wk: width hk: height for each default box.

For aspect ratio of 1:1,

Therefore, we have 6 result boxes per feature map localization with different aspect ratio.

For layers with only 4 bounding boxes, ar = 1/3 and 3 are omitted.

Hard Negative Mining:

Instead of using all the negative examples, we sort them using the highest confidence loss for each default box and pick the top ones so that the ratios between the negatives and positives is at most 3:1.

Atrous Convolution:

Increase the receptive field while keeping number of parameters relatively fewer compared with conventional convolution.

Search This Blog

Sophie's Daily Note