A study on self-supervised sketch-based image retrieval on unpaired data (S3BIR)

Waldo Campos, Jose M. Saavedra*, Christopher Stears

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Sketch-based Image Retrieval (SBIR) is a prevalent task in computer vision, where a model should produce a bimodal sketch-photo feature space. Training such a model requires sketch-photo pairs to adjust a bimodal neural network. However, accessing paired data is impractical in real scenarios, such as in the case of eCommerce search engines. To address this problem, we can leverage self-supervised learning strategies to learn the sketch-photo space, which has yet to be explored. Therefore, this work presents a study of the performance of diverse self-supervised methodologies adapted to the SBIR domain. The term self-supervised means the model cannot access sketch-photo pairs, relaxing the training to see only pseudo-sketches generated during training. So far, our study is the first to explore diverse self-supervised mechanisms for SBIR (S3BIR). Our results show the outstanding performance of contrastive-based models like SimCLR and CLIP adapted to SBIR under a self-supervised regimen. S3BIR-CLIP is the best model in terms of effectiveness, achieving a mAP of 54.03% in Flickr15K, 45.38% in eCommerce, and 13.80% in QD. In the eCommerce dataset, we increase performance by around 20% w.r.t. previously published results. However, regarding resource consumption, S3BIR-SimCLR is the most competitive.

Original languageEnglish
JournalNeural Computing and Applications
DOIs
StateAccepted/In press - 2025

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2025.

Keywords

  • Representation learning
  • Self-supervised learning
  • Sketch-based image retrieval

Fingerprint

Dive into the research topics of 'A study on self-supervised sketch-based image retrieval on unpaired data (S3BIR)'. Together they form a unique fingerprint.

Cite this