Evaluating Bag of Features in Scene Classification

The demand for automation of machine in today’s world led to unsupervised learning. Bag of words being one such unsupervised technique, is dated only a decade back. However, research in this area has been no less. This project presents you with detailed implications of feature detection and extraction techniques on the Bag of Words’ application. In previous studies, detailed analysis of: dimension complexity, time complexity and space complexity has not been attempted. Also, we present to readers, different scenarios in which a particular method would be beneficial over the other.

Introduction

Scene Classification is one of the most advancing areas in the field of Image Processing. Because of its use in: weather forecast, browsing for a particular class of images, recognition of objects within a scene, medical diagnosis; demand for efficient scene classification algorithm rises. On one hand, if scene classification deals with images, on the other, text mining uses textual words. With a large–scale collection of data, efficient and robust classification techniques are needed for easy search. Scene classification of images depends mainly on: texture, transformation, color of the image. Use of efficient local feature detection and extraction techniques is necessary to capture these details.

Researchers have come up with various feature detection and extraction algorithm. In this paper, we implement some prominent local feature extraction methods. Feature extraction is the retrieval of important information which describes the system, many a times better than the system itself. We believe each feature extraction technique suites better for its respective system, it is designed for. Therefore, an attempt has been made in this report to provide this correlation of system to extraction technique. Henceforth, we experiment with our features providing: dimensional and time analysis. Further, we show a real world application of feature extraction in scene classification.

Bag of words a.k.a Bag of Visual Words as implied to images is an unsupervised classification method. Bag of Words clusters histogram of parts. The term dictionary in the Bag of Words is analogous to linguistic dictionary; in a way that it contains a histogram representation of image parts. And the term bag of words, symbolizes clustering these histograms into classes. Thereby, different classes represent the classification of images. This project presents, added dimensional complexity of scene classification using the Bag of Words; which most researchers have not dealt with. Our motivation for this project budded from Image–net scene classification challenge. Also, this project is our startling attempt to classify images with support of Dr. Dapeng Wu. We believe, this paper will benefit readers researching to compare various feature detection and extraction methods, unsupervised learning with bag of words, scene classification. We hope ample visual content helps in better understanding.

Related Work

Scene classification is a rigorously researched area. Vailaya et al. [1] is regarded as one of the forerunners to use global features to classify images. To mention few prominent research papers in the field of bag of words:

The first paper on scene classification using BOW by Lazebnik et al. [3] segregated image into grids. Then the spatial pyramid model was built upon taking histogram of these grids. Second paper on bag of words, by Monay et al. [4] used probabilistic approach. They built upon the probability of classification using BOW finding out the semantic difference between natural and man-made scenes. Third paper by Boutell et al. [5] configured pair wise relationship between different scenes. Also paper by Gemert et al. [6] presented similar concept of occurring histogram.

Comparisons of Local feature extraction and detection is found in abundance. Some to mention: Paper by Chao et al.[2]compared various feature detectors and descriptor for image retrieval in JPEG–encoded query images. Also, paper by Patel et al [7] compared various feature detectors and descriptor for real time face tracking. Additionally, stand- alone research papers on local feature detector and descriptors helped us in better understanding and applying the same in our project. In 2004 David G. Lowe introduced SIFT as descriptor as well as extractor. Further, as improvement to SIFT, Bay et al. [8] introduced SURF. Later in 2006 Rosten et al [9] introduced the FAST algorithm for real time applications.

As we can see the comparison of detectors and descriptors for BOW has still a long way to go. Also, with observation, scene classification using BOW produces better results with local feature extraction than global. Therefore, in our report, we explore various local feature extraction techniques and aim to provide the best design specific to the system.

Methods

The first step in scene classification is to detect and extract Features. We evaluate various feature detection and extraction algorithm for various databases of UICU sports database, MIT indoor database, scenes–15 and Motajabi’s workshop database. Further, we explore matching techniques of Flann Based and Brute force using Hamming distance. With our input data as feature vectors, we implement unsupervised classification using Bag of Visual Words. Bag of words is evaluated for various combinations of feature extraction and detection algorithms. Use of machine learning classifier includes usage of K–means and SVM. K–means is an unsupervised classifier, whereas, SVM is a supervised classifier.

A. Feature Detection and Extraction

Feature detection is corner or blob detection in an image. These are called key region of interest, as they can be easily located even with transformation of images. Whereas, feature extractor computes feature–vectors on key region of interest. In our project we use known complex detection methods: SIFT, SURF, FAST, STAR, MSER, GFFT and DENSE and extraction methods: SIFT, SURF, BRIEF.

compare

Right from top: SIFT, SURF, BRIEF, FAST, Left from top: STAR, MSER, GFFT, DENSE Feature Extractors

1. Scale-invariant feature transform - SIFT

Three steps of binary detection are: Firstly, find sampling points called key regions or the special regions. Secondly, map key points in order to achieve rotational invariance. Lastly, Refine key points, discarding unimportant ones.

Sift feature detector is an improvement over the Harris corner detector, which detects corners even for rotated images and scaled images. Sift was introduced by David G. Lowe. There are three steps in SIFT feature detection. Firstly, scale– space is implemented using difference of Gaussian filters. Also, it is necessary to implement different window size for different scale of the image. Therefore, the window size is chosen based on image scale, and local extrema for the image is configured. Secondly, Key point localization technique is used to refine extrema by thresholding. Thirdly, each key point is made invariant to orientation by constructing an orientation map for each key point.

For feature extraction, 16 x 16 neighboring pixels of a key point are considered. Then, with 4 x 4 sub blocks, we have a total of 16 regions. Considering 8 bins of histogram for each region, gives us a total of 128 dimensions for each key point. Over 128 x n, with n being number of key points, rotational illumination invariance techniques are applied.

2. Speeded up Robust features – SURF

SURF is the faster version of SIFT. However, our experiments show: SURF feature extractor is less efficient than SIFT. Compared to SIFT, SURF has added LOG approximation and BOX filtering. Using, BOX filters, with integral images, we can easily find convolution in parallel for all key points. In order to achieve scale invariance SURF uses Hessian matrix. Also, SURF uses Gaussian weights in a linear direction to achieve rotational invariance.

For feature extraction, SURF uses linear wavelets. Around detected key point, 20 x 20 neighboring pixels are considered. Further, 20 x 20 regions are segregated into 4 x 4 sub regions. Applying vector V = (sigma∑dx , ∑dy , ∑ |dy| , ∑ |dx|), we can achieve 64 dimensional feature vector for each key point.

3. Features from accelerated segment test - FAST

FAST feature detection algorithm emerged for real time high speed applications. It was proposed by Roster et al [9]. In FAST, thresholding selects key points. Example: for a pixel 'p', 16 pixels around it are considered in circular fashion. Then all the magnitude of pixel points in the circles are compared with 'p'. Then, 'p' is considered a key point only if it is the maximum or minimum of the set. Similarly corner test is performed considering 3 surrounding pixels.

There are certain limitations to FAST feature detection. Feature key points which can be found with no delay are ignored by FAST algorithm. Also, key points derived from the FAST technique are too close to each other. The latter limitation can be resolved by using maximal suppression. Over here, distance between key points is set at a constant threshold value. Even if, FAST feature detection excels over HARRIS corner detection in detecting corners

4. Maximally Stable Extremal Regions - MSER

MSER was introduced by Matas et al [10] which considered connection between images. MSER is known to work better for blurred images. Additionally, MSER achieves higher classification results for images with different illumination levels. As, MSER considered correspondence points, it outputs key regions instead of key points. Also, it is seen to it that stable regions are selected over unstable regions by thresholding. Multiscale detection of key regions by multi scale pyramid helps in achieving invariance to scale. Overall, MSER is one of the strongest feature detectors as it is blur, illumination and scale invariant.

5. Oriented FAST and rotated BRIEF – ORB

ORB was first introduced by Rublee et al [11]. as an alternative to SIFT and SURF. ORB is basically a modified version of fussed of FAST and BRIEF. Steps involved in ORB feature detection: In the first step, it implements the FAST algorithm and Harris corner detector. The FAST algorithm finds key points. Further, rotational invariance is achieved by calculating the centroid of the key point. The direction pointing from the centroid of the key point to the corner; detected by Harris key point is the orientation map of the key point. Therefore, ORB implements FAST and provides orientation information.

6. Good features to track – GFFT

GFFT was introduced by Shi and Thomas, to detect corners. For each pixel Gradient Matrix is applied. Here, integral values of the image are used. Henceforth, computationally it is constant. A maximum detects optimum features. However, with motion effects, there is a resulting error in generalization of the aperture. Even though thresholding and non–maximal suppression are used, GFFT performs poorly for images with motion effects.

7. DENSE features

The DENSE feature detection provides good coverage of the entire image. Here key points are considered in step and pixels whose contrasts vary the most are considered. Stepwise implementation helps in extracting all important information of an image. Also, more information can be extracted with the overlap of the patches. However, all these increase dimensional complexity.

8. Binary Robust Invariant Scalable Key points – BRISK

BRISK unlike ORB, has a handmade sampling pattern as shown in Fig. 4. Gaussian filter is applied to different regions separately. Additionally, threshold distinguishes patterns into short, long and unused pairs as shown in Fig. 5. Wherein, short hand distance pairs are used to compute intensity values, whereas long hand distance pairs are used to find orientation. Like ORB, BRISK performs better for view point change.

9. STAR feature detection

Center surrounded extrema was introduced by Agarwal et al. STAR uses 2 rotated squares to extract bi–level features of center surrounded extrema. Also, it considers 7 scales of the image. Unlike other feature detection technique, sampling size is fixed. For each 7 scales, spatial resolution of the complete image is considered. Finally, it is followed by detection of edges by Gradient matrix and line suppression.

table1 table2

B. Feature Matching

1. Brute Force Matcher – BF Matcher

Brute force matcher takes a descriptor of one feature of an image and iterates over the features of the other image till it finds the feature key point match. Here we use hamming distance. Brute force matcher increases time complexity as it iterates over complete image key points.

2. FLANN –Fast Library for Approximate Nearest Neighbors

Unlike BF Matcher, FLANN doesn’t iterate over all the features of the image, but using search trees; it derives approximate search. For high dimensional features and large dataset, FLANN comes with inbuilt fast nearest search of the key points. Comparative studies between BF Matcher and FLANN reveal BF Matcher more accurate than FLANN Matcher. However, FLANN based matcher is faster than BF Matcher even for large dimensional key points and large dataset. For the Bag of Visual Words clustering over a million features, in our project we stick to FLANN matcher.

C. Bag of Words

Bag of words is well known used method for classification of images and text. It originated from text mining of frequency of words in the dictionary. Opposite to supervised classification techniques, Bag of Words learns frequency of occurrence and learns the pattern on its own and generates code book. This code book is nothing but the histogram representation of each part as shown in Fig. 5. Codebook is analogous to words in the dictionary. Each word is a cluster of similar histogram patterns. Let discuss internal mathematical implementations in the Bag of Words.

Bag of Words is also used in object recognition analogous to scene classification. Internally, it evaluates using confusion matrix. Bayes model developed in Natural Language Processing –NLP can be used even for image classification.

Let ‘c’ be image category, ‘V’ be the size of Codebook and ‘w’ be V –dimensional vector. In matrix of ‘w’ one entity is:

W = [w1 , w2 ,………………, wn]

Wis the representation of the image using all the patches ‘w’ in the image.

Naïve Bayes classifier is used in categorization. Bag of Words classifier has to learn each categorization for an image. For example: nose, mouth for a face and wheel, break for a bike. This particular object level categorization is possible by using the Naïve Bayes classifier.

The higher levels of implementations of Bag of Words are VLDA encoding and Spatial Pyramid Pooling. Spatial pyramid pooling uses various resolutions of feature level extraction. Using Spatial Pyramid pooling, M. Koskela achieved the highest Image–net scene classification accuracy of around 91%. This is by using combinational advantages of Convolutional–Neural–Network–CNN and Spatial Pyramid Pooling. It took about 13 days of training on each set.

D. Classifier

Different rules of machine learning classifiers are governed by different mathematical principles. First let’s concentrate on unsupervised machine learning classifier: K–means. K–means classifies n observed classification into n clusters. These clusters can have different shape, although two points in the cluster are at a comparable distance. Points are randomly chosen, these are called ‘Code Vector’. Then, points nearby are clustered when they satisfy minimum Euclidean distance.

K–means is a heuristic algorithm of time complexity of 2Ω(n) . So, you cannot be sure that it will converge. Due to its time complexity, it takes squared more time for large data and would not suit the best.

Now let’s consider supervised machine learning technique. There are two steps: first training of data and second is the prediction of class category. Most used supervised Machine Learning classifiers are KNN and SVM. Accuracy wise SVM performs better, however, speed wise KNN. For given K an integer KNN classifies based on only K neighbors. Whereas, SVM considers the whole data and tries to separate it into as many classes as there are scene categories. SVM separates data using hyperplanes. But, not all sets are linearly separable. Henceforth, there are other functions of SVM for higher dimensional space. These are called as kernel functions. Some to name: Quadratic function, Gaussian Radial Basis function– RBF. Considering recent evaluations of various SVM kernel functions, it was found that SVM with RBF function classifies the best. However, with the infinite dimension generalization error of RBF kernel was said to increase. When a dataset is small compared to the feature vector dimension, it results in a case similar to infinite dimension. Therefore, in our experiment we use Linear kernel function of SVM.

E. Cross Validation

Cross validation sees to it that, no data are trained and tested at once. In this project threefold cross validation is used to find the overall accuracy of the set. The data are divided into three parts. 2/3rd part is used for training and rest is used for testing. Likewise, it is repeated three times covering all the data for both testing and training. After deriving accuracy for each set, overall accuracy is found by averaging.

EXPERIMENT

Evaluation of data included creation of 5 different datasets, with varied levels of complexity. Dataset–1 rates to the easiest level of classification problem, while Dataset–5 the most complex level.

A. Dataset

Testing of bag of words included creation of personal dataset. This data was created using a video sequence of Image–Net dataset. Videos at different time frames were captured. This included a class of bird and car as shown in Fig. consisting of 30 images each. Let this be dataset–1. While experimentation, we will refer this as Dataset–1.

data-1

For Dataset–2 of Motajabi’s workshop [12], we introduced a higher level of complexity. In total it included four classes of images, namely: airplane, bike, car and cheetahs. Each class has around 60 images each, so size of the dataset is 240. Variations ranged from the object being replaced with varied background scene to many objects in the scene. As seen in Fig. we trained and tested for many to one data of cheetahs.

data-2

Dataset–3 included elements like many objects, blur, far and close scenes, different image dimension. It was taken from the UIUC sports dataset [13], which included a total of 8 sports: rowing, badminton, polo, bocce, snowboarding, croquet, sailing, rock climbing. We selected 6 sports of total 8 and arranged them to find accuracy. Arranging of data is necessary in unsupervised learning to verify classification results of test data, if test data have no reference provided by Database provider. Therefore, our set consists of 6 sets of 120 images each.

data-3

Dataset–4 by MIT [14] has one of the highest levels of complexity. We arranged data into 10 classes of, namely: Airport–Inside, Bar, Bowling, Casino, Elevator–google, Inside–subway, locker–room, restaurant kitchen, and warehouse. Images of same classes sometimes have very little similarity. Look at Fig. The top row shows restaurant kitchen. For a computer, matching images of almost no similarity possible only by using neural networks. The arranged data has a total of 10 classes of 90 images each.

data-4

Dataset– 5 included scenes–15 [15] of over seven million images. We used NVIDIA server with GPUs to run our algorithm. It took hours to run for KNN, but SVM would take days. Results show BOW to be inefficient for large data. Therefore, results are not evaluated for this dataset. However using CNN classification accuracy of 70 % was achieved.

B. Evaluation

First step in our experiment, we evaluate feature detectors and extractors. Feature detectors such as SIFT, BRISK, ORB, SURF, FAST, STAR, MSER, GFFT and DENSE are fused with Feature extractors of SIFT and SURF. In our experiment, time taken for execution of different fusion techniques is compared. Feature extractors of SURF and SIFT produce almost the same results. As shown in Fig. Taking normalized time along the positive y axis, showed SIFT, GFFT and MSER to be efficient in time. However, this alone is not sufficient to evaluate time complexity of Bag of Words application.

compare1

Time taken for Feature detection.

Results

Database–1 has images with slight variation. Therefore, using Bag of Words we can achieve an accuracy of 100%. Number of features clustered by K–means has to be considered in order to evaluate space dimensionality. Fig. shows DENSE features to have the highest number of features clustered. Here as the dimension is normalized, in our experiments: DENSE has a dimension of around 600,000 and FAST around 200,000, rest in the range of 10–50 hundreds. A low computing system fails to evaluate DENSE. Thereby, we discard DENSE in our further experiments.

![compare4](https://cloud.githubusercontent.com/assets/11435669/20867663/fd9fea90-ba17-11e6-8a70-bf95e8c7caa6.png)

Comparing dimensions of Feature detectors

We once again fuse various feature detectors with SIFT and SURF feature extractors, for our Dataset–2. After analysis, SIFT feature extractor performed better than SURF feature extractor. However, with space complexity, few algorithms like GFFT cannot perform with SIFT. Therefore, even though less efficient, systems with space complexity works well with SURF. Otherwise, SIFT feature extractor produces better results. This comparison can be seen from Fig. (top), wherein, accuracy obtained from SIFT and SURF are normalized. Also, from Fig. (bottom) normalized error rates and dimensions are compared for different feature detectors with SIFT feature extractor. The lesser the dimension and error rates, the better the performance. Therefore, the FAST feature detector doesn’t seem to be perform better and can be ignored in our next experiment.

compare2 compare3

Comparison between SIFT and SIRF (top) , Dimension vs Error rates of Feature Detectors with SIFT Feature Extractor(bottom).

For Dataset–3 and Dataset–4, results are specific to the given system. These are given by table Table. 3. (top) and Table. 3. (bottom) respectively. First of all, for Dataset–3 we can see that: even though GFFT is faster, it is less efficient, due to motion effects in sports activities. Also, it can be seen that SURF–SIFT has time complexity as the data is large. However, for Dataset–3 SURF–SURF is the most efficient.

Now, comparing results of Dataset–3 to Dataset–4, it is evident that due to a smaller set of data in Dataset–4, SURF– SIFT is the most time efficient. Also, due to lesser motion effects, as Dataset–4 is mostly composed of still objects, GFFT is the fastest with optimum efficiency.

table3

Conclusion

According to our observation, choice of feature extractors and feature detectors is specific to the system. Previous studied showed SURF to be performing than SIFT. However we believe that it is true only in the case of Feature Detection. Thereby, from our observations, the fusion of SURF feature detection and SIFT feature extractor is the most efficient. However, it also has the higher time complexity. Therefore, with parallelization SURF –SURF performs optimally.

Additionally, for images with motion effects, GFFT perform worse than otherwise. To our knowledge, GFFT is the fastest performing feature detector. It produces optimal results for data without motion effects.

References

[1] A. Vailaya, M. Figueiredo, A. Jain, and H.J. Zhang. Image classification for content-based indexing. IEEE Trans. on Image Processing, 10(1):117–130, 2001.

[2] Performance Comparison of Various Feature Detector-Descriptor Combinations for Content-based Image Retrieval with JPEG-encoded Query Images by Jianshu Chao, Anas Al-Nuaimi, Georg Schroth and Eckehard Steinbach

[3] S. Lazebnik, C. Schmid, J. Ponce. 2006. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. IEEE Computer Vision and Pattern Recognition, pp. 2169-2178.

[4] F. Monay, P. Quelhas, J.-M. Odobez, and D. Gatica-Perez. Integrating co-occurrence and spatial contexts on patchbased scene segmentation. In CVPR, Beyond Patches Workshop, New York, NY, June 17–22, 2006.

[5] M. R. Boutell, J. Luo, and C. M. Brown. Factor graphs for region-based whole-scene classification. In CVPR, Semantic Learning Workshop, New York, NY, June 17–22, 2006.

[6] J. C. van Gemert, J. Geusebroek, C. J. Veenman, C. G. M. Snoek, and A. W. M. Smeulders. Robust scene categorization by learning image statistics in context. In CVPR, Semantic Learning Workshop, New York, NY, June 17–22, 2006.

[7] Performance Analysis of Various Feature Detector and Descriptor for Real-Time Video based Face Tracking by Akash Patel, D. R. Kasat.

[8] Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool "SURF: Speeded Up Robust Features", Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, pp. 346–359, 2008.

[9] E. Rosten and T. Drummond, “Machine learning for high speed corner detection,” in 9th Euproean Conference on Computer Vision, vol. 1, 2006, pp. 430–443.

[10] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In Proc. of British Machine Vision Conference, pages 384–396, 2002.

[11] Ethan Rublee, Vincent Rabaud, Kurt Konolige, Gary Bradski "ORB: an efﬁcient alternative to SIFT or SURF", Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.

[12] Refer: http://ttic.uchicago.edu/~mostajabi/Tutorial.html. By Mostajabi, 2011.

[13] L.-J. Li and L. Fei-Fei. What, where and who?Classifying events by scene and object recognition. In ICCV, 2007.

[14] A. Quattoni and A. Torralba. Recognizing indoor scenes. In CVPR, 2009.

[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006.

Bag-of-words-for-scene-classification
Bag-of-words-for-scene-classification copied to clipboard

Metadata

Evaluating Bag of Features in Scene Classification

Introduction

Related Work