Posts

Showing posts with the label DeepLearning

BatchSize、 学习率 、 Data augmentation、 Internal Covariate Shift

Batchsize 增大,达到同样精度所需要的epoch数量变多,因为随着样本数量的增大,标准差并不是线性减小的。 Batchsize 大更容易陷入局部最优,在相同的计算量下,因为标准差的问题收敛速度更慢。 Batchsize增大到某个时候就会达到时间上的最优。 Data Augmentation:   Color Jittering: 图像亮度,饱和度,对比度    PCA Jittering: 按照RGB 三个颜色通道计算均值和标准差,再在整个训练集上计算协方差矩阵,进行特征分解得到特征向量和特征值,用来做PCA Jittering Random Scale Random Crop Horizontal/Vertical Flip : 水平、垂直翻转 Shift: 平移变换 Rotation/Reflection: 旋转、仿射变换 Noise : 高斯噪声、模糊处理 Label Shuffle Internal Covariate Shift:  深度神经网络涉及到很多层的叠加,每一层参数更新都会导致上层输入数据分布发生变化,通过层层叠加,高层的输入分布会变化非常剧烈。 Covarite shift 就是值source domain 和target dmain 的条件概率是一致的,但边缘概率不同。 ICS 会导致的问题 :每个神经元的输入数据不再是独立同分布的。 1) 上层参数需要不断适应新的输入数据分布,降低学习速度 2) 下层输入的变化可能趋向于变大或者变小,导致上层落入饱和区,使得学习过早停止 3) 每层的更新都会影响到其它层,因此每层的参数更新策略需要尽可能的谨慎。

常见的激活函数

Image
1. sigmoid  函数定义为: Sigmoid: exists between 0 to 1 , it is especially used for models we have to predict the probability as an output. The function is differentiable and monotonic , but the function's derivative is not, which can cause a neural network to get stuck at the training time. Softmax function is a more generalized logistic activation function which is used for mutliclass classification. 2、tanh函数 (tangent activation function) The range of tanh function is from -1 to 1. The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph. The function is also differentiable and monotonic while its derivative is not monotonic . The tanh  function is mainly used in classification between two classes . Both tanh and logistic sigmoid activation functions are used in feed-forward nets. 3. ReLU(Rectified Linear Unit) activation function f(x) = max(0, x) ...

Pooling 的作用及优缺点

Pooling 层作用:   1.invariance(translation, rotation, scale) 2. 保留主要特征的同事减少参数和计算量(降维),提高模型的泛化能力 在神经网络中,池化函数一般在卷积函数下一层 池化操作是利用一个矩阵窗口在张量上进行扫描,将每个矩阵窗口中的值通过取最大值或者平均值来减少元素个数。 avg_pool, max_pool, max_pool_with_argmax 尺度不变性增大了感受野, 让卷积看到更多的信息,但是增大感受野的前提是降低分辨率,丢失了重要信息,这对segmentation要求的精确度location有一定的影响。 但为了增大感受野一开始就用跟图像一样大的卷积核,这样感受野不就变大了吗? 错,卷积层越深模型的表征能力越强,但此时降维会导致重要信息丢失。 Adopt from  https://blog.csdn.net/zxyhhjs2017/article/details/78607469

Bilinear Pooling

Image
Paper: Revisiting Bilinear Pooling: A coding Perspective 常见的特征融合 :BoW(Bag of Words), VLAD(Vector of Locally Aggregated Descriptor), FV(Fisher Vector)。 最近的研究表示:Bilinear Pooling (双线性池化)是一个更有效的特征融合方法。 双线性池化是通过建模特征的高阶统计信息来捕获特征之间的关系,进而生成具有表达力的全局表示。 问题: 1.双线性池化生成的表示含有大量的 信息冗余(redundancy) ; 2.双线性池化具有 突发性(burstiness) 的问题,降低了表示的判别力。 方法: 1.双线性池化的编码----池化框架 双线性池化方法的形式是: 其中是双线性池化的矩阵表示,将Z向量化得到z作为全局表示。在本文中,作者证明了双线性池化是一个基于相似性的编码--池化框架。全局表示z可以写成: 其中, B 是字典, 双线性池化计算双线性特征 fi 和字典 bl 的內积相似度。由相似度构成的编码通过一个求和池化(SumPooling)聚合成全局表示 z 。 在上述的编码--池化框架下,有三个的性质影响了双线性池化的性能: (1) 双线性特征 是秩为1的矩阵,含有大量的信息冗余; (2)字典B由输入的双线性特征决定 。 因此对不同的输入进行编码所使用的字典不同; (3)将双线性池化用于 多模态任务 时,字典元 共线,这影响了表示 z 的判别力。 2.分解的双线性编码 从编码的角度,作者提出了分解的双线性编码(FBC)融合特征。作者将 基于相似性的编码 替换成为 稀疏编码 (Sparse Coding),激活尽可能少的字典元并保持尽可能多的信息。与原始的双线性池化相比 ,分解的双线性编码学习一个全局字典进行编码,提高了 z 的判别力 。 对高维的双线性特征直接编码很容易引入大量的参数,为了避免这个问题,作者将字典元进行分解,其中每个字典元被分解成两个矩阵的乘积,矩阵分解的秩远小于双线性特征的维度。 分解的双线性编码与原始的双线性池化相比, 减少了大量的内存消耗。 例如在视觉问答任务中,文本特征...

Fast RCNN

Image
The idea is straightforward.  1)Instead of passing all regions into the convolutional layer one by one, we pass the entire image once and produce a feature map.   2)Then we take the region proposals as before (using some external method) and sort of project them onto the feature map .  3)Now we have the regions in feature map instead of the original image and we can forward them in some fully connected layers to output the classification decision and the bounding box correction. Note that the projection of regions proposal is implemented using a special layer( ROI pooling layer ), which is essentially a type of max-pooling with a pool size dependent on the input, so that the output always has the same size .

R-CNN

Image
Given an image with multiple objects, we generate some ROIs using a proposal method ( Selective Search ) and wrap the regions into a fixed size .  Then forward each region to CNN(such as AlexNet), which will use an SVM to make a classification decision for each one and predicts a regression for each bounding box. This prediction comes as a correction of the region proposed, which may be in the right position but not at the exact size and orientation. Although the model produces good results, it suffers from a main issue.  It is quite slow and computational expensive.  Image that in an average case, we produce 2000 regions, which we need to store in disk, and we forward each of them into the CNN for multiple passes until it is trained.

DeepSnake

Image
Traditional snake algorithms: Given an initial contour, traditional snake algorithms treat the coordinates  of the vertices as a set of variables and optimize an energy functional with respect to these variables.  Active contour models could optimize the contour to the object boundary. The energy functional is typically  nonconvex, the deformation process tend to find local optimal solutions. Network Architecture: Deep snake consists of three parts:  a backbone ,  a fusion block  and  a prediction head . The backbone is comprised  of 8 “CirConvBn-ReLU” layers and  uses residual skip connections for all layers . The fusion block aims to fuse  the information across all contour points at multiple scales. Detail: Add deep snake to an object detection model.  The detector first produces detected boxes that are used to construct  diamond contours. Then deep snake deforms the diamond vertices to obje...

Batch Normalization

BatchNorm address es the  internal covariate shift problems  by normalizing layer inputs, which makes using large learning rate to  accelerate network training  feasible.

class imbalance

Image
While negative samples are much more than positive samples, to deal with the large class imbalance , two ways can be tried: 1) Focal loss: use the focal loss as the loss on the output of the classification subnet; 2) Adding hard negative samples gradually(Hard example mining). Focal loss:  Address class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Focal loss is designed to address the one-stage object detection scenario in which there is an extreme imbalance between foreground and background classes during training. How to choose the hard negative to be included in the computation of loss? First, N negative samples are randomly selected as a candidate pool; Second, the negative samples in this pool are sorted in descending order based on their classification confidence scores and the top n samples are selected as the hard negatives.

Logistic vs Softmax

logistic函数也是神经网络最为常用的激活函数,即sigmoid函数,  应用于分类和回归 logistic具体针对的是二分类问题, softmax函数经常用在神经网络的最后一层,作为输出层, 解决的是多分类问题,因此从这个角度也可以理 解logistic函数是softmax函数的一个特例。 softmax建模使用的分布是 多项式分布 ,而logistic则基于 伯努利分布 softmax回归进行的多分类,类与类之间是 互斥的 ,即一个输入只能被归为一类; 多个logistic回归进行多分类,输出的类别并 不是互斥的 ,即"苹果"这个词语既属于"水果"类也属于"食物"类别。

CNN Layers

Four main layers: Convolutional layer --output neurons that are connected to local regions in the input ReLU layer --elementwise activation function Pooling layer --perform a downsampling operation along the spatial dimensions Fully-connected layer -same as regular neural networks Filters act as feature detectors from original image Network will learn filters that activate when they see some type of visual features ReLu converges much faster than sigmoid/tanh in practice Pooling Layer makes representations smaller and more manageable, helps control overfitting CNNs have much fewer connections and parameters which are easier to train, traditionally fully-connected neural network is almost impossible to train when initialized randomly