Reading TransFG

 TransFG: 

    1) verify the effectiveness of vision transformer on fine-grained visual classification which offers an alternative to the dominating CNN backbone with RPN model design

    2) naturally focuses on the most discriminative regions of the objects and achieve SOTA performance 

    3) visualization helps show the ability of capturing discriminative image regions


Methods:

    1) vision transformer as feature extractor

        image sequentialization: first preprocess the input image into a sequence of flattened patches, generating overlapping patches with sliding window

    2) TransFG architecture, propose the Part Selection Module (PSM) and apply contrastive feature learning to enlarge the distance of representations between similar sub-categories

    3) contrastive feature learning, minimizes the similarity of classification tokens corresponding to different labels and maximizes the similarity of classification tokens of samples with the same label.

    Paper source: https://arxiv.org/pdf/2103.07976.pdf

He, J., Chen, J.N., Liu, S., Kortylewski, A., Yang, C., Bai, Y., Wang, C. and Yuille, A., 2021. TransFG: A Transformer Architecture for Fine-grained Recognition. arXiv preprint arXiv:2103.07976.

Comments

Popular posts from this blog

Reading CLIP

Reading CutPaste

OOD-related papers