Abstract
Sketch-based Image Retrieval (SBIR) is a prevalent task in computer vision, where a model should produce a bimodal sketch-photo feature space. Training such a model requires sketch-photo pairs to adjust a bimodal neural network. However, accessing paired data is impractical in real scenarios, such as in the case of eCommerce search engines. To address this problem, we can leverage self-supervised learning strategies to learn the sketch-photo space, which has yet to be explored. Therefore, this work presents a study of the performance of diverse self-supervised methodologies adapted to the SBIR domain. The term self-supervised means the model cannot access sketch-photo pairs, relaxing the training to see only pseudo-sketches generated during training. So far, our study is the first to explore diverse self-supervised mechanisms for SBIR (S3BIR). Our results show the outstanding performance of contrastive-based models like SimCLR and CLIP adapted to SBIR under a self-supervised regimen. S3BIR-CLIP is the best model in terms of effectiveness, achieving a mAP of 54.03% in Flickr15K, 45.38% in eCommerce, and 13.80% in QD. In the eCommerce dataset, we increase performance by around 20% w.r.t. previously published results. However, regarding resource consumption, S3BIR-SimCLR is the most competitive.
Original language | English |
---|---|
Journal | Neural Computing and Applications |
DOIs | |
State | Accepted/In press - 2025 |
Bibliographical note
Publisher Copyright:© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2025.
Keywords
- Representation learning
- Self-supervised learning
- Sketch-based image retrieval