Bag-of-words-for-scene-classification
Bag-of-words-for-scene-classification copied to clipboard
Compare various feature extraction techniques and classify using BOG
Evaluating Bag of Features in Scene Classification
The demand for automation of machine in today’s world led to unsupervised learning. Bag of words being one such unsupervised technique, is dated only a decade back. However, research in this area has been no less. This project presents you with detailed implications of feature detection and extraction techniques on the Bag of Words’ application. In previous studies, detailed analysis of: dimension complexity, time complexity and space complexity has not been attempted. Also, we present to readers, different scenarios in which a particular method would be beneficial over the other.
Introduction
Scene Classification is one of the most advancing areas in the field of Image Processing. Because of its use in: weather forecast, browsing for a particular class of images, recognition of objects within a scene, medical diagnosis; demand for efficient scene classification algorithm rises. On one hand, if scene classification deals with images, on the other, text mining uses textual words. With a
Researchers have come up with various feature detection and extraction algorithm. In this paper, we implement some prominent local feature extraction methods. Feature extraction is the retrieval of important information which describes the system, many a times better than the system itself. We believe each feature extraction technique suites better for its respective system, it is designed for. Therefore, an attempt has been made in this report to provide this correlation of system to extraction technique. Henceforth, we experiment with our features providing: dimensional and time analysis. Further, we show a real world application of feature extraction in scene classification.
Bag of words a.k.a Bag of Visual Words as implied to images is an unsupervised classification method. Bag of Words clusters histogram of parts. The term dictionary in the Bag of Words is analogous to linguistic dictionary; in a way that it contains a histogram representation of image parts. And the term bag of words, symbolizes clustering these histograms into classes. Thereby, different classes represent the classification of images. This project presents, added dimensional complexity of scene classification using the Bag of Words; which most researchers have not dealt with. Our motivation for this project budded from
Related Work
Scene classification is a rigorously researched area. Vailaya et al. [1] is regarded as one of the forerunners to use global features to classify images. To mention few prominent research papers in the field of bag of words:
The first paper on scene classification using BOW by Lazebnik et al. [3] segregated image into grids. Then the spatial pyramid model was built upon taking histogram of these grids. Second paper on bag of words, by Monay et al. [4] used probabilistic approach. They built upon the probability of classification using BOW finding out the semantic difference between natural and
Comparisons of Local feature extraction and detection is found in abundance. Some to mention: Paper by Chao et al.[2]compared various feature detectors and descriptor for image retrieval in
As we can see the comparison of detectors and descriptors for BOW has still a long way to go. Also, with observation, scene classification using BOW produces better results with local feature extraction than global. Therefore, in our report, we explore various local feature extraction techniques and aim to provide the best design specific to the system.
Methods
The first step in scene classification is to detect and extract Features. We evaluate various feature detection and extraction algorithm for various databases of UICU sports database, MIT indoor database,
A. Feature Detection and Extraction
Feature detection is corner or blob detection in an image. These are called key region of interest, as they can be easily located even with transformation of images. Whereas, feature extractor computes
Right from top: SIFT, SURF, BRIEF, FAST, Left from top: STAR, MSER, GFFT, DENSE Feature Extractors
1. Scale-invariant feature transform - SIFT
Three steps of binary detection are: Firstly, find sampling points called key regions or the special regions. Secondly, map key points in order to achieve rotational invariance. Lastly, Refine key points, discarding unimportant ones.
Sift feature detector is an improvement over the Harris corner detector, which detects corners even for rotated images and scaled images. Sift was introduced by David G. Lowe. There are three steps in SIFT feature detection. Firstly, scale– space is implemented using difference of Gaussian filters. Also, it is necessary to implement different window size for different scale of the image. Therefore, the window size is chosen based on image scale, and local extrema for the image is configured. Secondly, Key point localization technique is used to refine extrema by thresholding. Thirdly, each key point is made invariant to orientation by constructing an orientation map for each key point.
For feature extraction, 16 x 16 neighboring pixels of a key point are considered. Then, with 4 x 4 sub blocks, we have a total of 16 regions. Considering 8 bins of histogram for each region, gives us a total of 128 dimensions for each key point. Over 128 x n, with n being number of key points, rotational illumination invariance techniques are applied.
2. Speeded up Robust features – SURF
SURF is the faster version of SIFT. However, our experiments show: SURF feature extractor is less efficient than SIFT. Compared to SIFT, SURF has added LOG approximation and BOX filtering. Using, BOX filters, with integral images, we can easily find convolution in parallel for all key points. In order to achieve scale invariance SURF uses Hessian matrix. Also, SURF uses Gaussian weights in a linear direction to achieve rotational invariance.
For feature extraction, SURF uses linear wavelets. Around detected key point, 20 x 20 neighboring pixels are considered. Further, 20 x 20 regions are segregated into 4 x 4 sub regions. Applying vector V = (sigma∑dx , ∑dy , ∑ |dy| , ∑ |dx|), we can achieve 64 dimensional feature vector for each key point.
3. Features from accelerated segment test - FAST
FAST feature detection algorithm emerged for real time high speed applications. It was proposed by Roster et al [9]. In FAST, thresholding selects key points. Example: for a pixel 'p', 16 pixels around it are considered in circular fashion. Then all the magnitude of pixel points in the circles are compared with 'p'. Then, 'p' is considered a key point only if it is the maximum or minimum of the set. Similarly corner test is performed considering 3 surrounding pixels.
There are certain limitations to FAST feature detection. Feature key points which can be found with no delay are ignored by FAST algorithm. Also, key points derived from the FAST technique are too close to each other. The latter limitation can be resolved by using maximal suppression. Over here, distance between key points is set at a constant threshold value. Even if, FAST feature detection excels over HARRIS corner detection in detecting corners
4. Maximally Stable Extremal Regions - MSER
MSER was introduced by Matas et al [10] which considered connection between images. MSER is known to work better for blurred images. Additionally, MSER achieves higher classification results for images with different illumination levels. As, MSER considered correspondence points, it outputs key regions instead of key points. Also, it is seen to it that stable regions are selected over unstable regions by thresholding. Multiscale detection of key regions by multi scale pyramid helps in achieving invariance to scale. Overall, MSER is one of the strongest feature detectors as it is blur, illumination and scale invariant.
5. Oriented FAST and rotated BRIEF – ORB
ORB was first introduced by Rublee et al [11]. as an alternative to SIFT and SURF. ORB is basically a modified version of fussed of FAST and BRIEF. Steps involved in ORB feature detection: In the first step, it implements the FAST algorithm and Harris corner detector. The FAST algorithm finds key points. Further, rotational invariance is achieved by calculating the centroid of the key point. The direction pointing from the centroid of the key point to the corner; detected by Harris key point is the orientation map of the key point. Therefore, ORB implements FAST and provides orientation information.
6. Good features to track – GFFT
GFFT was introduced by Shi and Thomas, to detect corners. For each pixel Gradient Matrix is applied. Here, integral values of the image are used. Henceforth, computationally it is constant. A maximum detects optimum features. However, with motion effects, there is a resulting error in generalization of the aperture. Even though thresholding and
7. DENSE features
The DENSE feature detection provides good coverage of the entire image. Here key points are considered in step and pixels whose contrasts vary the most are considered. Stepwise implementation helps in extracting all important information of an image. Also, more information can be extracted with the overlap of the patches. However, all these increase dimensional complexity.
8. Binary Robust Invariant Scalable Key points – BRISK
BRISK unlike ORB, has a handmade sampling pattern as shown in Fig. 4. Gaussian filter is applied to different regions separately. Additionally, threshold distinguishes patterns into short, long and unused pairs as shown in Fig. 5. Wherein, short hand distance pairs are used to compute intensity values, whereas long hand distance pairs are used to find orientation. Like ORB, BRISK performs better for view point change.
9. STAR feature detection
Center surrounded extrema was introduced by Agarwal et al. STAR uses 2 rotated squares to extract
B. Feature Matching
1. Brute Force Matcher – BF Matcher
Brute force matcher takes a descriptor of one feature of an image and iterates over the features of the other image till it finds the feature key point match. Here we use hamming distance. Brute force matcher increases time complexity as it iterates over complete image key points.
2. FLANN –Fast Library for Approximate Nearest Neighbors
Unlike BF Matcher, FLANN doesn’t iterate over all the features of the image, but using search trees; it derives approximate search. For high dimensional features and large dataset, FLANN comes with inbuilt fast nearest search of the key points. Comparative studies between BF Matcher and FLANN reveal BF Matcher more accurate than FLANN Matcher. However, FLANN based matcher is faster than BF Matcher even for large dimensional key points and large dataset. For the Bag of Visual Words clustering over a million features, in our project we stick to FLANN matcher.
C. Bag of Words
Bag of words is well known used method for classification of images and text. It originated from text mining of frequency of words in the dictionary. Opposite to supervised classification techniques, Bag of Words learns frequency of occurrence and learns the pattern on its own and generates code book. This code book is nothing but the histogram representation of each part as shown in Fig. 5. Codebook is analogous to words in the dictionary. Each word is a cluster of similar histogram patterns. Let discuss internal mathematical implementations in the Bag of Words.
Bag of Words is also used in object recognition analogous to scene classification. Internally, it evaluates using confusion matrix. Bayes model developed in Natural Language Processing
Let ‘c’ be image category, ‘V’ be the size of Codebook and ‘w’ be V
W = [w1 , w2 ,………………, wn]
Wis the representation of the image using all the patches ‘w’ in the image.
Naïve Bayes classifier is used in categorization. Bag of Words classifier has to learn each categorization for an image. For example: nose, mouth for a face and wheel, break for a bike. This particular object level categorization is possible by using the Naïve Bayes classifier.
The higher levels of implementations of Bag of Words are VLDA encoding and Spatial Pyramid Pooling. Spatial pyramid pooling uses various resolutions of feature level extraction. Using Spatial Pyramid pooling, M. Koskela achieved the highest
D. Classifier
Different rules of machine learning classifiers are governed by different mathematical principles. First let’s concentrate on unsupervised machine learning classifier:
Now let’s consider supervised machine learning technique. There are two steps: first training of data and second is the prediction of class category. Most used supervised Machine Learning classifiers are KNN and SVM. Accuracy wise SVM performs better, however, speed wise KNN. For given K an integer KNN classifies based on only K neighbors. Whereas, SVM considers the whole data and tries to separate it into as many classes as there are scene categories. SVM separates data using hyperplanes. But, not all sets are linearly separable. Henceforth, there are other functions of SVM for higher dimensional space. These are called as kernel functions. Some to name: Quadratic function, Gaussian Radial Basis function– RBF. Considering recent evaluations of various SVM kernel functions, it was found that SVM with RBF function classifies the best. However, with the infinite dimension generalization error of RBF kernel was said to increase. When a dataset is small compared to the feature vector dimension, it results in a case similar to infinite dimension. Therefore, in our experiment we use Linear kernel function of SVM.
E. Cross Validation
Cross validation sees to it that, no data are trained and tested at once. In this project threefold cross validation is used to find the overall accuracy of the set. The data are divided into three parts. 2/3rd part is used for training and rest is used for testing. Likewise, it is repeated three times covering all the data for both testing and training. After deriving accuracy for each set, overall accuracy is found by averaging.
EXPERIMENT
Evaluation of data included creation of 5 different datasets, with varied levels of complexity.
A. Dataset
Testing of bag of words included creation of personal dataset. This data was created using a video sequence of
For
Dataset– 5 included
B. Evaluation
First step in our experiment, we evaluate feature detectors and extractors. Feature detectors such as SIFT, BRISK, ORB, SURF, FAST, STAR, MSER, GFFT and DENSE are fused with Feature extractors of SIFT and SURF. In our experiment, time taken for execution of different fusion techniques is compared. Feature extractors of SURF and SIFT produce almost the same results. As shown in Fig. Taking normalized time along the positive y axis, showed SIFT, GFFT and MSER to be efficient in time. However, this alone is not sufficient to evaluate time complexity of Bag of Words application.
Time taken for Feature detection.
Results
Comparing dimensions of Feature detectors
We once again fuse various feature detectors with SIFT and SURF feature extractors, for our
Comparison between SIFT and SIRF (top) , Dimension vs Error rates of Feature Detectors with SIFT Feature Extractor(bottom).
For
Now, comparing results of
Conclusion
According to our observation, choice of feature extractors and feature detectors is specific to the system. Previous studied showed SURF to be performing than SIFT. However we believe that it is true only in the case of Feature Detection. Thereby, from our observations, the fusion of SURF feature detection and SIFT feature extractor is the most efficient. However, it also has the higher time complexity. Therefore, with parallelization SURF
Additionally, for images with motion effects, GFFT perform worse than otherwise. To our knowledge, GFFT is the fastest performing feature detector. It produces optimal results for data without motion effects.
References
[1] A. Vailaya, M. Figueiredo, A. Jain, and H.J. Zhang. Image classification for
[2] Performance Comparison of Various Feature
[3] S. Lazebnik, C. Schmid, J. Ponce. 2006. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. IEEE Computer Vision and Pattern Recognition, pp.
[4] F. Monay, P. Quelhas,
[5] M. R. Boutell, J. Luo, and C. M. Brown. Factor graphs for
[6] J. C. van Gemert, J. Geusebroek, C. J. Veenman, C. G. M. Snoek, and A. W. M. Smeulders. Robust scene categorization by learning image statistics in context. In CVPR, Semantic Learning Workshop, New York, NY, June
[7] Performance Analysis of Various Feature Detector and Descriptor for
[8] Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool "SURF: Speeded Up Robust Features", Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, pp.
[9] E. Rosten and T. Drummond, “Machine learning for high speed corner detection,” in 9th Euproean Conference on Computer Vision, vol. 1, 2006, pp.
[10] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In Proc. of British Machine Vision Conference, pages
[11] Ethan Rublee, Vincent Rabaud, Kurt Konolige, Gary Bradski "ORB: an efficient alternative to SIFT or SURF", Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
[12] Refer: http://ttic.uchicago.edu/~mostajabi/Tutorial.html. By Mostajabi, 2011.
[13]
[14] A. Quattoni and A. Torralba. Recognizing indoor scenes. In CVPR, 2009.
[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006.