Research Article | ![]()
SAM-CLIP Search: Faster Region-Based Image Similarity Matching Using Lightweight Segmentation & Contrastive Learning
Author(s): Dr. Umar M Mulani1, Dr. Mahavir A. Devmane2, Dr. Satpalsing Devising Rajput3, Pramod A. Kharade4, Sagar Baburao Patil5, Dr. Amol Rajmane6, Yogesh Kadam7, Dr. Anindita A Khade8, Yogesh Bodhe9, and Kuldeep Vayadande10*
Published In : International Journal of Electrical and Electronics Research (IJEER) Volume 13, Issue 4
Publisher : FOREX Publication
Published : 30 November 2025
e-ISSN : 2347-470X
Page(s) : 638-649
Abstract
Imagine a designer browsing through an enormous image database for photos that contain both "a chair and a table" or a wildlife scientist attempting to find all photos of "brown bears near water." Locating such specific combinations manually is too time-consuming and cumbersome. To address this issue. This system SAM-CLIP Search is an area-based vision-language image search platform that incorporates CLIP and vision-language embeddings into the Segment Anything Model (SAM) to offer flexible prompt-based segmentation. Our approach makes precise picture search with point, box, or text prompts feasible compared to typical CBIR approaches, which often struggle with multi-object queries as well as cross-modal alignment. We propose a Ranking Optimization Layer (ROL) that produces context-specific relevance scores by aggregating spatial overlap (IoU) with semantic embedding distance and we substitute the conventional FAISS indexing with a light-weight cosine similarity approach to improve efficiency. Our method yields semantically and visually coherent matches on the COCO (val2017), Flickr30K and Fashion200K benchmarks. With maintaining fast inference, SAM-CLIP Search is better in critical metrics such as Recall@K, mAP and NDCG compared to baselines such as ViT+KNN, Deep Image Retrieval (DIR)and CLIP-only models. Its suitability for high-impact applications such as content curation, surveillance and medical image analysis is exemplified by a user study that verifies its effectiveness in difficult scene searches. The SAM-CLIP Search model achieves a retrieval accuracy of 92% with Recall@5 of 89.3%, Precision@5 of 87.1% and an average top-1 similarity score of 0.78 which makes SAM- CLIP Search Faster Region-Based Image Similarity Matching Using Lightweight Segmentation and Contrastive Learning a more accurate and efficient approach for region-based image retrieval.
Keywords: Image Retrieval, Segment Anything Model, CLIP (Contrastive Language-image Pretraining), Prompt-based Segmentation, Vision language Embeddings, Box prompt, Point Prompt, Object Detection, Image Segmentation, Visual Search.
Dr. Umar M Mulani, MIT Art, Design and Technology University, Pune, India; Email: umar.mulani@gmail.com
Dr. Mahavir A. Devmane, VPPCOE & VA, Mumbai, India; Email: dmahavir@gmail.com
Dr. Satpalsing Devising Rajput, Pimpri Chinchwad University, Pune, India; Email: rajputsatpal@gmail.com
Pramod A. Kharade, Bharati Vidyapeeth College of Engineering, Kolhapur, India; Email: pramod.kharade@bharatividyapeeth.edu
Sagar Baburao Patil, Bharati Vidyapeeth College of Engineering, Kolhapur, India; Email: someone.sagar@gmail.com
Dr. Amol Rajmane, JSPM University, Pune, India; Email: amolbrajmane@gmail.com
Yogesh Kadam, Bharati Vidyapeeth's College of Engineering Lavale Pune, India; Email: yogesh.kadam@bharatividyapeeth.edu
Dr. Anindita A Khade, SVKM'S NMIMS Deemed to be University, Navi Mumbai Maharashtra, India; Email: aninditaac1987@gmail.com
Yogesh Bodhe, Government Polytechnic, Pune, India; Email: bodheyog@gmail.com
Kuldeep Vayadande, Vishwakarma Institute of Technology, Pune, India; Email: kuldeep.vayadande@gmail.com
-
[1] Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12), 2481-2495.
-
[2] Lytvyn, V., Peleshchak, R., Rishnyak, I., Kopach, B., & Gal, Y. (2024). Detection of Similarity Between Images Based on Contrastive Language-Image Pre-Training Neural Network. In COLINS (1) (pp. 94-104). Maji, S., & Bose, S. (2021). CBIR using features derived by deep learning. ACM/IMS Transactions on Data Science (TDS), 2(3), 1-24.
-
[3] Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., & Terzopoulos, D. (2021). Image segmentation using deep learning: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(7), 3523-3542.
-
[4] Schiavo, A., Minutella, F., Daole, M., & Gomez, M. G. (2021). Sketches image analysis: Web image search engine usinglsh index and DNN inceptionv3. arXiv preprint arXiv:2105.01147.
-
[5] Su, P. T., Lin, C. C., Chen, C. H., & Lee, J. C. (2022, October). Automatic Lung Cancer Segmantation based on Deep Learning. In 2022 IET International Conference on Engineering Technologies and Applications (IET-ICETA) (pp. 1-2). IEEE.
-
[6] Üzüm, E. (2021, June). Deep learning-based image segmantation and classification for fashion detection on smartphones. In 2021 29th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
-
[7] Chen, D., Ye, B., Zhao, Z., Wang, F., Xu, W., & Yin, W. (2022, July). Change detection converter: Using semantic segmantation models to tackle change detection task. In 2022 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-6). IEEE.
-
[8] Wang, Y., Ha, T., Aldridge, K., Duddu, H., Shirtliffe, S., & Stavness, I. (2023). Weed mapping with convolutional neural networks on high resolution whole-field images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 505-514).
-
[9] Zhang, C., Zhao, J., & Feng, Y. (2023, May). Research on semantic segmentation based on improved PSPNet. In 2023 International Conference on Intelligent Perception and Computer Vision (CIPCV) (pp. 1-6). IEEE.
-
[10] Sun, Y., & Ochiai, H. (2021). Maximum-likelihood-based performance enhancement of clipped and filtered OFDM systems with clipping noise cancellation. IEEE Wireless Communications Letters, 11(3), 448-452.
-
[11] Zhang, X., Kang, H., Cai, Y., & Jia, T. (2023, September). CLIP Model for Images to Textual Prompts Based on Top-k Neighbors. In 2023 3rd International Conference on Electronic Information Engineering and Computer Science (EIECS) (pp. 821-824). IEEE.
-
[12] Chatterjee, R., Chakrabarty, S., & Bishwas, P. (2025, February). ClipXpert: Automated Clip Mining from Video Data for High-Demand Content. In 2025 3rd International Conference on Intelligent Systems, Advanced Computing and Communication (ISACC) (pp. 13-18). IEEE.
-
[13] Han, Y., & Li, Q. (2024, April). DFLM: A Dynamic Facial-Language Model Based on CLIP. In 2024 9th International Conference on Intelligent Computing and Signal Processing (ICSP) (pp. 1132-1137). IEEE.
-
[14] Adil, M., Akhtar, N., Khan, S. S., & Shoaib, H. (2025, March). Enhancing Text-to-Video Retrieval Using Clip Based Deep Learning Approach. In 2025 IEEE 14th International Conference on Communication Systems and Network Technologies (CSNT) (pp. 159-163). IEEE.
-
[15] https://www.kaggle.com/datasets/awsaf49/coco-2017-dataset
-
[16] https://www.kaggle.com/datasets/adityajn105/flickr30k
-
[17] https://www.kaggle.com/datasets/mayukh18/fashion200k-dataset

I. J. of Electrical & Electronics Research