Reading CLIP
Learning directly from raw text about images is promising which leverages a much broader source of supervision.
Task: predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of pairs.
After pre-training, natural language is used to is used to reference learned visual concepts enabling zero-shot transfer of the model to downstream tasks.
[Paper Source]
https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language.pdf
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. and Krueger, G., 2021. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.
Comments
Post a Comment