Posts

Showing posts with the label Object detection

Loss Function for Object Detection Regression

Image
In object detection, loss = classification loss + bounding box regression loss Bounding box regression loss development: Smooth L1 loss-->IoU loss --> GIoU Loss --> DIoU Loss -->CIoU Loss 1. Smooth L1 loss (proposed in Fast RCNN): x: difference of predicted bounding box and the ground truth; L1 = |x|dL2(x)/x = 2x L2 = x^2 smoothL1(x) = 0.5x^2 if |x|<1                       = |x|-0.5  otherwise Derivatives of the above three loss function: dL1(x)/x = 1 if x >= 0                = -1 otherwise dL2(x)/x = 2x dsmoothL1(x) = x if |x| < 1                         = +-1  otherwise From above, the derivative of L1 los s is constant, when x becomes small in late training and learning rate remains same, the loss function will fluctuate around certain value and...

SSD-Single Shot Detector

Image
There is one class of models for localization and object detection, called  Single Shot Detectors , which are even faster and require less computational cost in general. By processing the image only once and output the prediction immediately, these types of models are called Single Shot Detectors. Single Shot Detectors(SSD): Instead of having a dedicated system to propose ROIs, we have a set of predefined boxes to look for objects, which are forwarded to a bunch of convolutional layers to predict class scores and bounding box offsets .  Then for each box we predict a number of bounding  boxes with a confidence score assigned to each one, we detect one object centered in that box and we output a set of probabilities for each possible class. Once we have all that, we simply and maybe naively keep only the box with high confidence score.  Details below adopt from: https://towardsdatascience.com/review-ssd-single-shot-detector-object-detection-851a9460...

Fast RCNN

Image
The idea is straightforward.  1)Instead of passing all regions into the convolutional layer one by one, we pass the entire image once and produce a feature map.   2)Then we take the region proposals as before (using some external method) and sort of project them onto the feature map .  3)Now we have the regions in feature map instead of the original image and we can forward them in some fully connected layers to output the classification decision and the bounding box correction. Note that the projection of regions proposal is implemented using a special layer( ROI pooling layer ), which is essentially a type of max-pooling with a pool size dependent on the input, so that the output always has the same size .

R-CNN

Image
Given an image with multiple objects, we generate some ROIs using a proposal method ( Selective Search ) and wrap the regions into a fixed size .  Then forward each region to CNN(such as AlexNet), which will use an SVM to make a classification decision for each one and predicts a regression for each bounding box. This prediction comes as a correction of the region proposed, which may be in the right position but not at the exact size and orientation. Although the model produces good results, it suffers from a main issue.  It is quite slow and computational expensive.  Image that in an average case, we produce 2000 regions, which we need to store in disk, and we forward each of them into the CNN for multiple passes until it is trained.

DeepSnake

Image
Traditional snake algorithms: Given an initial contour, traditional snake algorithms treat the coordinates  of the vertices as a set of variables and optimize an energy functional with respect to these variables.  Active contour models could optimize the contour to the object boundary. The energy functional is typically  nonconvex, the deformation process tend to find local optimal solutions. Network Architecture: Deep snake consists of three parts:  a backbone ,  a fusion block  and  a prediction head . The backbone is comprised  of 8 “CirConvBn-ReLU” layers and  uses residual skip connections for all layers . The fusion block aims to fuse  the information across all contour points at multiple scales. Detail: Add deep snake to an object detection model.  The detector first produces detected boxes that are used to construct  diamond contours. Then deep snake deforms the diamond vertices to obje...

Faster-RCNN

Image
Contribution: 提出了 RPN (Region Proposal Network)和 anchor box, 可以使用神经网络来提取proposal,优点如下: 1)几乎是cost-free的,因为RPN和特征提取的CNN共享参数; 2)RPN可以同时预测bounding box 和 objective score; 3)提出了anchor box,可以用多种宽高比和尺度来预测proposal; 4) 可以使得整个目标检测的网络进行end-to-end training。 Architecture: 在Fast-RCNN的基础上加入了RPN(region proposal network) Faster RCNN 主要包括4个部分: Conv layers, RPN, RoI pooling, Classifier 。 其中RoI pooling,classifier属于Fast RCNN, 而Conv layers 是RPN和Fast RCNN共享的部分。 1. Shared conv layers: 使用了13个卷积层和4个pooling层来提取图片的feature map, 这部分是Fast RCNN 和RPN共享的; 2. Region Proposal Network: RPN 用于生成region proposals。该层通过softmax判断 achors属于positive 或者negative, 再利用bounding box regression 修正anchors以获得精确的proposals; 3. ROI pooling: 根据regional proposal和feature map来提取每个region 对应的feature, 送入后续的全连接层判定目标类别; 4. Classification and Regression: 和Fast RCNN类似,这部分主要是完成两个工作: (1)一个是经过FC layer + softmax进行分类,主要是对object proposal进行分类的,一共是k+1类,包括背景类; (2)另一个是经过FC layer + bbox regressor 输出的,这个就是为k个类各输出4个值。 上...