Image Similarity using Deep Ranking

Image similarity: the measure of how similar two images are. In other words, it quantifies the degree of similarity between intensity patterns in two images.

Triplet: a triplet contains a query image, a positive image and a negative image.

How to measure the similarity of two images?

L1-norm: Manhattan distance
L2-norm: Euclidean distance

Loss function:
The whole deep ranking architecture can be thought of as a function that would map the image to a point in the Euclidean space. The goal is to learn an embedding function that assigns smaller distance to more similar images.
D(f(pi), f(pi+)) < D(f(pi), f(pi-))
<pi, pi+, pi-> such that r(pi, pi+) > r(pi,pi-)

f is the embedding function that would map the image to a vector. pi is the query image, pi+ is the positive image, pi- is the negative image and r is the similarity distance between two images.

The hinge loss for the triplet is defined as:
l(pi, pi+, pi-) = max{0, g+D(f(pi), f(pi+)) - D(f(pi),f(pi-))}

l is the hinge loss for the triplet, g is a gap parameter that regularizes the gap between the distance of the two image pairs: (pi, pi+) and (pi, pi-),  and D is the Euclidean distance between the two euclidean points.

Network Architecture:
The most crucial component is to learn an image embedding function f. A deep learning technique is employed to learn image similarity models directly from images.



ConvNet architecture:

The ConvNet encodes strong invariance and captures the image semantics. The other two parts of the network take down-sampled images and use shallow network architecture. Those two parts have less invariance and capture the visual appearance. Finally, we normalize the embeddings from the three parts, and combine them with a linear embedding layer.

A convolutional layer takes an image or the feature maps of another layer as input, convolves it with a set of k learnable kernels, and puts through the activation function to generate k feature maps. The convolutional layer can be considered as a set of local feature detectors.

A maxpooling layer performs max pooling over a local neighborhood around a pixel. The max pooling layer makes the feature maps robust to small translations.

A local normalization layer normalizes the feature map around a local neighborhood to have unit norm and zero mean. It leads to feature maps that are robust to the differences in illumination and contrast.

Triplet Sampling:
P(pi+) = min{Tp, ri,i+} /Zi
Tp is a threshold parameter
Zi is the sum of the probability of the positive image, sharing the same categories with pi.

ri,i+ - ri,i- >= Tr


Online Triplet Sampling:
1. create a set of buffers to store images, each buffer has a fixed capacity and it stores images from the same category;
2. For a new image pj, its key kj = uj(1/rj), rj is its total relevance score, uj is a uniformly sampled number from 0 to 1;
3.The buffer corresponding to the image pj can be found according to its category cj. If the buffer is not full, the image pj is inserted into the buffer with key kj.  Otherwise, the image pj' with the smallest key kj' in the buffer is selected;
4. If kj > kj', the image pj' is replaced with image pj in the buffer. Otherwise, the image pj is discarded. 


Adapt from https://medium.com/@akarshzingade/image-similarity-using-deep-ranking-c1bd83855978


Composing Text and Image for Image Retrieval--An Empirical Odyssey

Feature composition between text and images has been extensively studied in the field of vision and language, especially in VQA. 

The goal is to learn an embedding space for text+image query and for target images, such that matching (query, image) pairs are close.

The query/reference image x is encoded using a ResNet-17 CNN to get a 2d spatial feature vector.

The query text t is encoded using a standard LSTM. 

Finally, the two features are combined to compute the combined features. 


SMILY:
The first step: to create the SMILY database for searching. 

The core algorithm of this step is a CNN that condenses image information into a numerical feature vector, termed an embedding.

When computed across image patches cropped from slides, this created a database of patches and a numerical summary of each match's information content. 

When a ROI is selected for searching, the embedding for the query image is computed and compared with those in the database to retrieve the most similar patches. 

Comments

Popular posts from this blog

Reading CLIP

Reading CutPaste

OOD-related papers