Leveraging semantic segmentation for hybrid image retrieval methods

Content-based image retrieval (CBIR) is the task of finding images in a database that are the most similar to the input query based on its visual characteristics. Several methods from the state of the art based on visual methods (bag of visual words, VLAD, etc.) or recent deep leaning methods try to solve the CBIR problem. In particular, deep learning is a new field and used for several vision applications including CBIR. But, even with the increase in the performance of deep learning algorithms, this problem is still a challenge in computer vision. In this work, we propose three different methodologies combining deep learning-based semantic segmentation and visual features. We show experimentally that by exploiting semantic information in the CBIR context leads to an increase in the retrieval accuracy. We study the performance of the proposed approach on eight different datasets (Wang, Corel-10k, Corel-5k, GHIM-10K, MSRC V1, MSRC V2, Linnaeus and NUS-WIDE)


Introduction
The exponential increase in image acquisition and information technology has enabled the creation of large image datasets. Therefore, it is important to create data frameworks to effectively deal with these collected images. In particular, content-based image retrieval (CBIR) systems offer a solution to quickly find an image in a large amount of data.
CBIR is a fundamental step in many applications and can be used to solve a large variety of tasks. For example, when searching on the web or in a large image database, a user can have difficulties to express his need. With a CBIR algorithm, this problem called intention gap can be solved by providing an example image instead of a textual description. CBIR can also be very useful in robotics, where an image from an on board camera can be used for visual localization. The same applies in augmented reality systems and in many other applications.
CBIR is the task of retrieving the images similar to the input query from a dataset based on their contents. A CBIR system (see Fig. 1) is often based on three main steps: (1) feature extraction, (2) signature construction and (3) retrieved images. The performance of any proposed approach depends on the way an image signature is constructed. Therefore, the construction of image signatures is a key step and the core of a CBIR system. The state of the art mentions two main approaches used to retrieve the closest images: BoVW [1] (bag of visual words) and CNN [2] (convolutional neural network) descriptors for image retrieval.
Those methods make use of information such as color, shape and texture. A few authors propose to explicitly take into consideration the semantic information that can be extracted from the images. For example, [3][4][5][6] use classical semantic segmentation based on K-means. We think that thanks to the development of modern CNN architectures and training datasets for semantic segmentation, this information can be incorporated in an effective way to CBIR algorithms. The output of a segmentation network is a 2D map that associates a semantic label (class) with each pixel. This is a high-level representation suitable for building an image signature invariant to viewpoint and illumination. Based on the semantic content given from the semantic segmentation output and the bag of visual words model [7], we propose in this work three different ways of constructing the image signature to improve the CBIR task and image classification. We show that the use of semantic information offers potential for improvement over standard approaches with benefits in terms of accuracy and computation time. It is an extension of the framework we initially proposed in [8], with a new image representation proposal and a complete study of the framework through extensive experiments. Three methodologies are proposed to build the image signature as well as a semantic filter to obtain our final image representation. Our contributions are as follows: • a signature combining interest points and semantic information • a signature combining visual features and semantic information • a signature depending only on semantic information • a semantic filter able to neutralize and penalize the images which are semantically different from an input query.
Our experimental results highlight the potential of our proposals with considerably more retrieved images than current state-of-the-art techniques on eight retrieval datasets. The rest of the paper is structured as follows: We provide a brief overview of convolutional neural networks descriptors and bag of visual words-related works in Sect. 2. We explain our proposals in Sect. 3. We present the experimental part on different datasets and discuss the results of our work in Sect. 4. Section 5 is the conclusion.
2 State of the art Many CBIR systems have been proposed during the last years [9][10][11][12]. In the literature, there are three main methods for retrieving the images by similarity: (1) methods based on visual features extracted from the image using visual descriptors, (2) learning methods based on deep learning architecture for constructing a global signature extracted from the feature layer and (3) end-to-end CNN-based methods. Let us start by describing the methods based on visual features. Bags of visual words (BoVW) or bags of visual features (BoF) [7] is the popular model used for image classification and similarity (see Fig. 2). BoVW is treated as follows. For each image, the visual features are detected and extracted using a visual descriptors such as SIFT [13].
This step is repeated on all the images until all the visual descriptors in the dataset have been collected. Then a clustering step using K-means [14] is applied on the descriptors to build the visual vocabulary (visual words) from the center of each cluster. In order to obtain the bags of visual words, the features extracted from the query image are replaced by the index of the visual words which are the nearest using euclidean distance. Finally, the image is described as a histogram of the frequency of the visual words. Inspired by BoVW, vector of locally aggregated  [15] improves over BoVW by assigning to each visual feature its nearest visual word and accumulate this difference for each visual word. Fisher vector encoding [16] uses GMM [17] to construct a visual word dictionary. VLAD and Fisher are similar but VLAD does not store second-order information about the features and use K-means instead of GMM. Many descriptors have been proposed to encode the local image features into a vector. Scale-invariant feature transform (SIFT) [18] and speeded up robust features (SURF) [19] are the most used descriptors in CBIR. Interesting work from Arandjelovi c and Zisserman [20] introduces an improvement by upgrading SIFT to RootSift. In [21], a novel multi-scale 2D feature detection and description algorithm is presented. Inspired by LBP [22] descriptors, [23] proposes a novel method for image description with multi-channel decoded LBPs and [24] propose a novel descriptor algorithm using local tetra patterns (LTrPs). [25] present a new image feature description based on the local wavelet pattern (LWP) for medical image retrieval. The authors in [26] propose a robust and invariant descriptor to rotation and illumination. Another method inspired by BoVW is the bag of visual phrase (BoVP) [27][28][29]. BoVP describe the image as a matrix of visual phrase occurrence instead of a vector in BoVW. The idea is to link two or more visual words by a criterion. Then the phrase can be constructed in different ways (sliding windows, k-nearest neighbors, graph). In [28], local regions are grouped by the method of clustering (single linkage). [29] group each key point with its closest spatial neighbors using L 2 distance. [30] proposes a framework between local and global histograms of visual words. The image features can be encoded then extracted based on color [31][32][33][34], texture [35,36] or shape [37,38].
In [39], a framework based on color features and texture analysis is presented. [40] introduces an effective image indexing technique where the features are extracted from discrete cosine transform (DCT) coefficients. [41] proposes a discriminative EODH descriptor with strong rotation-invariant and scale-invariant feature. [42] present a discriminative image descriptor dependent on both contour and color information. Inspired by bag of visual features, [43] proposes an image signature using spatial information. [44] proposes a combination between HSV color moments and gray-level co-occurrence matrix for a robust CBIR system. In [45], the proposed technique applies the texton layout to distinguish then extract the consistent zone of an image. Therefore, it computes the dominant color descriptor feature on the pixels in this consistent zone.
On the other side, deep learning has proved very useful in computer vision applications. In particular, convolutional neural networks (CNN or ConvNet) are commonly applied to analyze image content. The architecture of CNN is composed of a set of layers. The major layers are: the input layer, hidden layers and the output layer. At the beginning of CNN networks, The CBIR problem was solved based on the classification model. Many CNN architectures have been proposed, including AlexNet [46], VGGNet [47], GoogleNet [48] and ResNet [49]. The fully connected layer (feature layer) is usually found toward the end of the CNN architecture with a 4096-dimensional floating point vector and it describes the image features (color, shape, texture, etc.). The works in [50][51][52] present CNN for multiple image categorization instead of assigning a single label by image. The computing of the similarity between two images is based on the L 2 metric between the feature vector from the feature layer and the evaluation is based on the mean average precision (MAP). NetVLAD [53] inspired by VLAD is a CNN architecture used for image retrieval. [54] reduces the training time and provides an improvement in accuracy. Using ACP is frequent in the CBIR application thanks to its ability to reduce the descriptor dimension without losing its accuracy. [55] uses convolutional neural network (CNN) to train the network and support vector machine (SVM) to train the hyper-plane then computes the distance between the feature image and the trained hyper-plane. [56] introduced a novel neural network which use the heterogeneous superpixel to facilitate image object relational analysis. Based on neural network architecture for content-based image retrieval, [57] proposes an efficient feature extraction method. Recently, convolutional neural networks (CNNs) have become more efficient for image retrieval tasks. In [58], the introduced model uses a ResNet50 with co-occurrence matrix (RCM) model for CBIR. The authors in [59] propose an image signature based on VGG16 model. Recently, several authors [60][61][62][63][64][65]65] have proposed new detectors and descriptors based on deep learning which can replace classical local features. Their utilization is getting increasingly frequent in computer vision applications. Moreover, CNN can give a global descriptor of an image such as LBP [22]. The proposed works [66][67][68], transform an input image into a global representation. Descriptors based on deep learning have been shown to be more robust against rotation and illumination changes than classical descriptors.

Contributions
The majority of CBIR systems describe the image as a Ndimensional vector. Bag of visual words [7] represent the image as a frequency histogram of vocabulary that are in the image. In deep learning approaches, the image signature is a vector of N floats extracted from the feature layer.
In this section, we present a signature construction framework. Our aim is to improve the image representation without prior knowledge on the images. The efficiency of any CBIR system depends on the robustness of the image signature. Figure 3 presents the different steps of our global framework. Motivated by the recent successes of deep learning in particular the convolutional neural networks (CNN), we propose three different methodologies to construct the images signature: (1) signature combining interest points and semantic information denoted bag of semantic visual words (BoSW), (2) signature combining visual features and semantic information denoted bag of semantic labels (BoSL) and (3) signature depending only on semantic information denoted bag of semantic proportion (BoSP). After building the image signatures, we improve the CBIR algorithm with a semantic filter. First, we classify the images based on their semantic content. We check that the candidate shares the same classes labels with the query. If this is the case, we consider the candidate as a true candidate and proceed with the distance computation step. Otherwise, the images are semantically different and we can prune the candidate. This semantic filter decreases the CBIR computation time and increases the CBIR accuracy. Finally, we compute the distance between the query and the selected candidates using the L 2 distance.

Bag of semantic visual words: BoSW
In this section, we present a new idea to construct an image signature. We need two main elements to successfully construct the image signature: visual feature descriptors and semantic segmentation (2D map). By incorporating these information, we build a robust feature description for an image taking into account both semantic and visual description.
We define the signature as a M Â N matrix where the width N corresponds to the size of the descriptor (128 for SIFT) and the height M corresponds to the number of classes on which the network was trained. Figure 4 describes the different steps of our approach. The process of construction is composed of three different steps: (i) detection and extraction of the visual features, (ii) extraction of semantic information and (iii) clustering the key points by class label and computing the center of the clusters. To compute the center, for each class label on the image we select the set of key points that belong to it and we apply the clustering algorithm (K-means) with K=1 (average of key points). Consequently, each class label will be represented by a vector of N float denoted semantic visual words Svw i . Finally the image signature is composed of N semantic visual words that represent the existing class labels in the image. It is not necessary that the image contains all classes. In this case, we assign a null vector for the missing classes.

Bag of semantic labels: BoSL
Inspired by bag of visual words [7], we propose a bag of semantic labels that describes the occurrence of semantic label within an image. Our idea applies on two main steps. First, we start by detecting the interest points from an input image. Then, we project the pixel coordinate (x, y) of the detected points on the segmented image given by the semantic segmentation network (Fig. 5). As a result, we obtain for each image a frequency histogram of labels that are in the image. The vector size corresponds to the number of classes on which the network was trained. It is not necessary that the image contains all semantic classes given by the network. In this case, we assign a null value in the cells of the missing classes.

Bag of semantic proportions: BoSP
Deep learning-based semantic segmentation networks output a 2D map that associates a semantic label (class) with each pixel. From this output, we can know the objects in the image and their proportion. So, fully depending on the CNN output, we exploit the semantic segmentation information to build a semantic signature ''bag of semantic proportions'' for image similarity. Since the term similar means here ''with the same semantic content,'' our signature compare the images according to their semantic content. The construction process needs only the 2D map from the output of the deep semantic architecture. As shown in Fig. 7, given a segmented image we divide it into N subimages where each one represent a semantic object in the image. In the next step, we create a feature vector whose size corresponds to the number of classes on which the network was trained and each element contains the proportion of a semantic object in percentage. Also here, we

Semantic filter
Semantic segmentation indicates which objects exist in the image. Using this information, we propose a semantic test to check the semantic similarity between two images. This means, we check whether two images (the candidate and the query) share the same semantic classes. If this is the case, we proceed with the computing distance step. Otherwise, the images are semantically different and we can prune the candidate. The checking phase leads to decrease the CBIR time by keeping only the images that have the same semantic content as the query. This step reduces the complexity of computing from Oðn 2 Þ to O(nlog(n)) and also increases the CBIR accuracy. According to the experiments, the semantic filter offers us an increase in MAP score between 4% and 6%. For more explanation, Fig. 7 shows how we exploit the semantic filter to neutralize and penalize the images considered semantically different. In our example, the predicted classes in the input query are grass and dog. We exploit this information to find the images that share the same semantic classes with the query. Once we get the list of images considered as true candidates, we compute the distance between them. For the dissimilar images, we assign a negative score in order to neutralize them in the retrieval step.

Benchmark datasets for retrieval
In this section, we present the potential of our approach on eight different datasets. Our goal is to increase the CBIR accuracy and reduce the execution time. To evaluate our proposition, we test on the following datasets: • Corel 1K or Wang [69] is a dataset of 1000 images divided into 10 categories (see Fig. 8) and each category contains 100 images. The evaluation is computed by the average precision of the first 100 nearest neighbors among 1000 (Fig. 9). • Corel 10K [70] is a dataset of 10,000 images divided into 100 categories and each category contains 100 images. The evaluation is computed by the average precision of the first 100 nearest neighbors among 10,000. • Corel 5K [70]

Training datasets for semantic segmentation
Many semantic segmentation datasets have been proposed in the last years such as Cityscapes [73], Mapillary [74], CoCo [75], ADE20K [76], CoCo-stuff [77], Mseg [78] and others. The semantic datasets are composed by two main objects: stuff and things. Things objects have characteristic shapes like vehicle, dog, computer, etc. Stuff is the description of amorphous objects like sea, sky, tree, etc.
To segment an image, we use the recent architecture named high-resolution Net (HRNet) [79] with HRNetV2-W18 as backbone. We choose this network because of its superior results compared to older networks and its ability to produce a high-resolution representation of an input image. This architecture is trained on a large collection of datasets cited in Table 1.

Results on benchmark datasets for retrieval
In Table 2, we present the results obtained for the bag of semantic visual Words approach (BoSW). We conducted our experiments by training the segmentation network on six semantic datasets (Table 1). We then tested BoSW on eight benchmarking datasets. We extract the visual features with floating point-based descriptors (KAZE, HOG, SURF). Three different extensions of KAZE descriptors have been used in this work (K region , K edge and K sharpedge ). Moreover, we have additionally tested our methodology utilizing the local detector and descriptor SuperPoint based on deep learning [62].
In Table 3, we present the results obtained for the bag of semantic labels method (BoSL). With the same setup, we conducted our experiment by using six semantic datasets for training the segmentation network (Table 1). Then we tested on eight retrieval datasets. We detect the interest points using seven different detectors (SURF, KAZE, Harris, FAST, MinEigen, MSER and SuperPoint).
In Table 4, we present the results obtained for the full semantic signature (BoSP). What distinguishes this method is that it can quickly classify the image depending only on the semantic segmentation without any additional information. Among the three proposed signatures, we obtained the best score with the BoSW method. However, BoSL and BoSP have shwown close results just between 1% and 3% below those of BoSW (and in a few cases the results are better than BoSW). Also, these two methods are faster and easier to implement compared to BoSW. Table 5 summarizes the executions time for different steps of our framework. We conclude that BoSW is a method which takes more time for signature construction than BoSP and BoSL due to the computation of the semantic words. The BoSW method depends on descriptors, so when the descriptor size increases, the signature creation time also increases. For each query from Corel-5K/Corel-10K datasets, the evaluation is computed by the average precision of the first 100 nearest neighbors among all images in the datasets. In Fig. 10, the experiments are made on different image sizes (10, 20... 100).
In the top part of Fig. 10, we present the MAP score obtained by the bag of semantic visual words (BoSW) method with the segmentation network trained on CoCo-Stuff. The results are obtained using five different descriptors (SIFT, SURF, KAZE, HOG, SuperPoint). For KAZE descriptors, we complete the evaluations with three extensions where the points are detected with three different detectors (edge, region and sharp edge). The best results were obtained with the SuperPoint descriptor. In the middle part of Fig. 10, we present the map score obtained by the bag of semantic label (BoSL) method with the segmentation network trained on CoCo-Stuff. The results are shown for several detectors (SURF, KAZE, Harris, FAST, MinEign, MSER, SuperPoint). We notice that the best results were obtained with the KAZE detector. On the . We notice that the best results were obtained with the CoCo-Stuff dataset. On the other hand, the worst results were obtained with both Pascal Context and Mapillary datasets. In Fig. 11, we present the average precision AP for each class in the NUS-WIDE dataset using BoSP. We have also done experiments with a support vector machine (SVM) with linear kernel using the histogram computed using BoSP method. For a detailed comparison, the confusion matrix and ROC curve for Corel dataset for 10 categories are displayed in Fig. 12. As shown, the results for the experiment with the SVM show the robustness of our method.

Comparisons with the state of the art
In order to test the efficiency of our methods, we conducted the experimentation on eight retrieval datasets. We divided the state of the art into two main categories: (i) local visual feature approaches: methods that are based on local features (texture, color, shape) including the inherited methods such as BoVW, Vlad and Fisher and (ii) deep learning approaches: methods based on learning the features using deep learning algorithms. In Table 6, we compare our results with a large selection of state-of-the-art methods.
Following conventional settings [51,52] for NUS-WIDE, we report the following statistics: the average overall precision (P), the average overall recall (R) and F1-measure (F1). For each query, the 3 closest images are retrieved. Those with a distance to the query higher than 0.5 are eliminated. In Table 7, we present the quantitative results obtained by our method and compared with state-ofthe-art methods on the NUS-WIDE dataset.
In Figure 13, we compare the mean average precision (MAP) of the top 5 and top 10 retrieved images for all categories for the Wang [69] dataset between our method (BoSP) and a CNN-based method [59]. Our method presents a good performance in almost all categories. In Table 8 and Fig. 14, similar comparisons for the top 20 retrieved images are shown with a wide selection of stateof-the-art methods [9-12, 30, 39-41, 57]. For these comparisons, our method is the BoSW method whose signature was built on the basis of a CNN trained on CoCo-Stuff.
In Fig. 15, we compare our approach (BoSW) with methods based on color [31,32,34] or texture [35,36] histogram on different image sizes (10, 20,..., 100). We show experimentally on Corel5K and Corel-10K datasets that the BoSW method offers potential for improvement over standard approaches with benefits in terms of accuracy.
It is important to compare the runtime of the proposed methods with both deep learning and local visual state-ofthe-art methods. In Table 9, we compare only the time taken by each method for signature construction. Our BoSP, BoSL signatures are more than 135 times smaller than those built on VGG-16 architecture and 90 times smaller than those extracted with the ResNet architecture. In addition, we obtain a vector length smaller than all cited methods in Table 9. This is an advantage in terms of searching time and memory. We consider the vector length of the proposed methods is 64 if during the building of the signature the descriptor used is SURF or KAZE. For VLAD [15], N-BoVW [27], BoVW [7], the length of the vector depends on the number K which is used to calculate the visual words using the K-means algorithm.
Based on semantic content, we show some examples (Fig. 16) of bag of semantic proportion (BoSP) output. From different categories selected from different datasets (Corel 1K, MSRC V1), we show for each query the three nearest neighbors predicted on CoCo-Stuff dataset.
In Tables 10, 11, we present the confusion matrix obtained by [7,33 ] methods for Corel 1000 dataset. Then, we compared our confusion matrix obtained by BoSP (see Fig. 17) against four methods from the state of the art  Fig. 12 ROC curves and confusions matrices for BoSP method on Corel 1000 dataset using several datasets for training the segmentation network [5,7,15,33]. Our confusion matrix shows the best results in all categories except the dinosaur class in method [33]. In addition, we compare the roc curves and confusion matrices between our method (BoSP) and [7,15] in Fig. 17. We show experimentally that our proposals present better outcomes against standard image retrieval approaches.

Conclusion
We have presented in this paper an efficient framework for CBIR tasks and image classification. We leverage the discriminative information provided by a semantic segmentation CNN in the retrieval context to propose three different methodologies. Based on semantic content combined with local visual features, our propositions have shown that the use of the semantic content improves the retrieval accuracy. Another contribution of this paper is the proposed semantic filter. It allows the proposed framework to reduce the error rate and speed up the comparison between images. Using different descriptors, detectors and semantic datasets our approach achieves better results in terms of accuracy and computation time compared to the state-of-the-art methods. N is the number of classes for which the segmentation network was trained Fig. 16 Example of bag of semantic proportion (BoSP) output  (10) Africa (1) Table 11 Confusion matrix of method [5] Classes (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Africa (1)