Reading TransFG
TransFG:
1) verify the effectiveness of vision transformer on fine-grained visual classification which offers an alternative to the dominating CNN backbone with RPN model design
2) naturally focuses on the most discriminative regions of the objects and achieve SOTA performance
3) visualization helps show the ability of capturing discriminative image regions
Methods:
1) vision transformer as feature extractor
image sequentialization: first preprocess the input image into a sequence of flattened patches, generating overlapping patches with sliding window
2) TransFG architecture, propose the Part Selection Module (PSM) and apply contrastive feature learning to enlarge the distance of representations between similar sub-categories
3) contrastive feature learning, minimizes the similarity of classification tokens corresponding to different labels and maximizes the similarity of classification tokens of samples with the same label.
Paper source: https://arxiv.org/pdf/2103.07976.pdf
He, J., Chen, J.N., Liu, S., Kortylewski, A., Yang, C., Bai, Y., Wang, C. and Yuille, A., 2021. TransFG: A Transformer Architecture for Fine-grained Recognition. arXiv preprint arXiv:2103.07976.
Comments
Post a Comment