machine-learning
machine-learning copied to clipboard
Investigate 'bootstrap aggregation' implementation
We need to investigate the requirements to streamline the following bootstrap aggregation methods:
import random
import numpy as np
import sklearn.ensemble
def bagging(D, Y, X, be=None, k=10):
# D is the dataset of features
# Y is the classifications of each point in D
# X is the points to be classified
# be is the base estimator (which classifier type to ensemble?) Default
# is decision trees
# k is the number of base estimators to use
features = 1.0
# features is how many to features to use for each estimator. 1.0
# means use them all.
samples = 1.0
# samples is how many samples to draw for bagging purposes (see
# explanation of bagging)
# set up the ensemble classifier
bagc = sklearn.ensemble.BaggingClassifier(base_estimator=be,
n_estimators=k,
max_samples=samples,
max_features=features,
verbose=0)
# train the classifier
bagc.fit(D, Y)
# return an array of classifications
return bagc.predict(X)
# the dataset
D = [[0, 1],
[0, 2],
[3, 4],
[5, 6],
[6, 3],
[4, -2],
[-1, 4],
[2, 3],
[9, -1],
[-2, -4],
[-1, -5],
[10, 0],
[-10, 0]]
# assign the above points to a random class
Y = [random.choice(['A', 'B']) for X in D]
# put it into the right format for sklearn
D = np.array(D)
Y = np.array(Y)
# generate a single random point to classify
X = np.random.randint(low=-10, high=10, size=(1, len(D[0])))
# print out the point and it's classification
XC = bagging(D, Y, X, None, 20)
print "Test 1:"
for xc in range(len(XC)):
print X[xc], XC[xc]
# generate 10 random points to classify
X2 = np.random.randint(low=-10, high=10, size=(10, len(D[0])))
# print out the 10 points and their classifications
XC2 = bagging(D, Y, X2, None, 30)
print "Test 2:"
for xc2 in range(len(XC2)):
print X2[xc2], XC2[xc2]
What is a decision tree?
A decision tree typically uses several decisions to classify a point.
For example:
if x > 0 and y = Red, class A else if x = 0 and y = Red, class B else if y = Blue, class C else, class D
======================================
What is an ensemble?
An ensemble classifier usually uses multiple decision trees to classify a point, but it can also combine other classifiers.
The ensemble typically uses either a vote or an average to produce a prediction. In a vote, the class that the majority of classifiers predict is chosen. In an average, the probabilities of each class as predicted by each classifier are averaged and the one with the highest probability is chosen.
So if classifiers the first 3 classifiers predict that the point is class A, but only the fourth and fifth classifiers predict that the point is class B, then a voting ensemble will predict class A.
=======================================
What is bagging (or "bootstrap aggregating")?
Bagging is a way of creating the classifiers for the ensemble.
The bagging method randomly chooses M points from a training dataset (with replacement, so duplicates are possible). Then, it fits a classifier based on this data.
It does this process N times, resulting in N classifiers. These classifiers are then used in a typical ensemble classifier with either voting or averages to predict an outcome.
In the code, M is represented by "samples". N is represented by "k".
This is very good. Assume that I don't understand the meaning of the variables for this bagging classifier, and I only understand basic statistics (dependent vs independent variables). What does each point in the dataset D
represent? For example, the current database, is unstructured. However, there will be an assumption that it will be structured. Specifically, every dataset ingested into the application, should theoretically, be useable for any algorithm, with the exception, that the properties
, will really define how the dataset
(s) will be used. Perhaps, also more specifically describe X
, and Y
.
Could you also research the implementation for the bagging regressor. When we create the python wrappers to these functionalities, it should be pretty trivial. Just a matter of querying things out of mongodb, and ingesting it into the format you have demonstrated.
Each point in the dataset is typically a series of measurements or attributes about the thing being measured.
For example, if we take a group of people and for each person we measure their height in cm and their weight in lbs, one such point (x,y) might look like this:
(500cm, 120lbs)
Attributes can also include set based values such as eye color. Even though they're non-numeric ('Brown', 'Green', 'Blue'), decision trees will easily handle them (where other algorithms will be less forgiving).
So D, in the case of height and weight, could look like this: np.array([[500,120], [400,123], [520, 150]]). and it represents 3 people's data. One person is 500cm and 120lbs, another is 400cm and 123lbs, and the last is 520cm and 150lbs. D is intended to be a training set, which means classification is known for this set. So we should also include Y, which are the classifications.
For example, if we were attempting to predict the ethnicity of each of these people based on their measurements, Y might look like:
np.array(['American', 'Korean', 'American'])
, which indicates that the person with the measurements 500cm and 120lbs is American, and so on. The number of items in Y should be equivalent to the number of points in D.
The classifier is trained on this data so when you attempt to predict point(s) X: (510cm, 130lbs)
it can give its guess of American, or Korean based on the data it was provided.
Typically, it is desirable to predict more than one point X, so sklearn asks for an array of points to predict.
This is how basically all classifiers in sklearn work. D, Y, and X are common to just about every classifier, with some exceptions.
===============================
Specifically for bagging, what bagging does is attempt to improve the stability of certain algorithms by using multiple "different" training datasets, all generated from the same dataset, to train several classifiers which then vote on the class prediction of new points.
As an example, using the height and weight dataset again.
[500,120] [400,123] [520,150] [450,160] [600,180] [580,220]
Lets assume samples=4 and k=3
Then the bagging classifier creates 3 decision trees from 3 different training sets. The three training sets, D1, D2, and D3 might look like this: D1 = [[500,120], [500,120], [450,160], [600,180]] D2 = [[580,220], [600,180], [400,123], [500,120]] D3 = [[450,160], [580,220], [500,120], [520,150]]
Each sample dataset is randomly chosen from the initial dataset D, with replacement. So duplicates may be chosen - note that [500,120] is repeated in D1. Also, since it's random, some points may never be used - note that [400,123] is in none of the new datasets.
D1, D2, and D3 are then used to create three decision trees, T1, T2, and T3.
k=3 means the number of trees here, and samples=4 means the number of points in each sample dataset.
These trees are created and stored internally.
Then, when presented with the points in X to predict, the classifier makes a prediction with each T1, T2, and T3.
For example, T1 might predict the sample point describes a Korean, while T2 and T3 might agree it describes an American.
The ensemble recognizes that there are more votes to predict an American, so it returns "American" for this point.
The number of trees/classifiers k should be greater than the total number of classes, and is also an odd number.
If there are 3 classes and k=3, then each tree could predict a different class and the ensemble wouldn't be able to come up with an answer. Similarly, if k is even, there could be a 2-2 split between the classifiers and the ensemble wouldn't be able to decide.
import random
import numpy as np
import sklearn.ensemble
def baggingR(D, Y, X, be=None, k=10):
# D is the dataset of features
# Y is the classifications of each point in D
# X is the points to be classified
# be is the base estimator (which classifier type to ensemble?) Default
# is decision trees
# k is the number of base estimators to use
features = 1.0
# features is how many to features to use for each estimator. 1.0
# means use them all.
samples = 1.0
# samples is how many samples to draw for bagging purposes (see
# explanation of bagging)
# set up the ensemble classifier
bagc = sklearn.ensemble.BaggingRegressor(base_estimator=be,
n_estimators=k,
max_samples=samples,
max_features=features,
verbose=0)
# train the classifier
bagc.fit(D, Y)
# return an array of classifications
return bagc.predict(X)
# the dataset
D = [[0, 1],
[0, 2],
[3, 4],
[5, 6],
[6, 3],
[4, -2],
[-1, 4],
[2, 3],
[9, -1],
[-2, -4],
[-1, -5],
[10, 0],
[-10, 0]]
# assign the above points to a random class
Y = [random.uniform(-1.0, 1.0) for X in D]
# put it into the right format for sklearn
D = np.array(D)
Y = np.array(Y)
# generate a single random point to classify
X = np.random.randint(low=-10, high=10, size=(1, len(D[0])))
# print out the point and it's classification
XC = baggingR(D, Y, X, None, 20)
print "Test 1:"
for xc in range(len(XC)):
print X[xc], XC[xc]
# generate 10 random points to classify
X2 = np.random.randint(low=-10, high=10, size=(10, len(D[0])))
# print out the 10 points and their classifications
XC2 = baggingR(D, Y, X2, None, 30)
print "Test 2:"
for xc2 in range(len(XC2)):
print X2[xc2], XC2[xc2]
The BaggingRegressor has almost the same use as the BaggingClassifier. The only difference I can see in terms of functionality is that the Regressor doesn't predict from a set of classes, it predicts from a range of real numbers. So instead of having Y be a choice between ['A', 'B'] for each point, it's a choice of a real number between [-1.0, 1.0] in this case. The range is arbitrary, but it's essentially a way of assigning a "score" to each point instead of a classification. This might be useful if you were trying to rank people's obesity on a scale of 1-10, rather than trying to guess their ethnicity.
What are some of the other (supported / nonsupported) options, other than the decision tree base_estimator
, for either the classifier, or the regressor? Can we do random forrest, or some other approach? I wonder if we can in the future, bootstrap a custom estimator (less important right now).
Well this is actually very close to a random forest already. A random forest works the same way, the only real difference is in how the trees are generated. Bagging is more versatile cause it can be applied to any "base estimator", while random forests use decision trees exclusively. In random forests, the decision tree generation is more random.
Bagging uses randomness to generate k trees. All the trees use the same number of features, though. In Random Forests, each tree randomly selects a subset of the features and builds the tree based off that.
It does this primarily to reduce the correlation between trees.
There's a RandomForestClassifier which is separate from the Bagging one.
The base estimator can basically be any type of classifier. You just give it an object and it'll train the data using the bagging method on that classifier instead of decision trees and apply the vote method. Example:
BaggingClassifier(KNeighborsClassifier())
Not all classifiers work as well as decision trees for bagging which is why they use trees as the default
Can the api readily support the following classifiers, and regressors as its base estimator:
-
svm
-
svr
-
random forrest
-
knn
-
decision tree
(this their default) - some other custom estimator
How would we be able to pipe the above estimators, using the sample bagging
functions you've outlined?
Ah, took a little bit more time thinking of your statement:
The base estimator can basically be any type of classifier. You just give it an object and it'll train the data using the bagging method on that classifier instead of decision trees and apply the vote method. Example:
BaggingClassifier(KNeighborsClassifier())
So, basically we just supply the estimator as an argument to the bagging
classifier, or regressor. However, I'm still curious when you say any type of classifier, if this means literally any classifier, and regressor. I wonder if we can actually perform any of the non-exhaustive list:
- svm
- svr
- random forrest
- knn
- decision tree (this their default)
- some other custom estimator
How compatible is the sklearn api, between bagging with above classifiers, and regressors?
I'll do a test in a bit and see
# using svms instead of decision trees
XC2 = bagging(D, Y, X2, sklearn.svm.SVC(), 30)
print "SVM:"
for xc2 in range(len(XC2)):
print X2[xc2], XC2[xc2]
# using k nearest neighbors
XC2 = bagging(D, Y, X2, sklearn.neighbors.KNeighborsClassifier(), 30)
print "KNN:"
for xc2 in range(len(XC2)):
print X2[xc2], XC2[xc2]
# using random forests instead of decision trees (note this means
# MULTIPLE random forests, not just one forest)
XC2 = bagging(D, Y, X2, sklearn.ensemble.RandomForestClassifier(), 30)
print "Random Forests:"
for xc2 in range(len(XC2)):
print X2[xc2], XC2[xc2]
# using multiple bagging classifiers in a bagging classifier
XC2 = bagging(D, Y, X2, sklearn.ensemble.BaggingClassifier(), 30)
print "Bagging:"
for xc2 in range(len(XC2)):
print X2[xc2], XC2[xc2]
those all work without errors
SVR doesn't fit into Bagging Classifier because it's regression-based not classifier-based. It'd most likely fit into BaggingRegressor though
That's great! My guess, is that all of them would work for the BaggingRegressor
, except the svm
? Basically, an inverse of your most previous statement.
# SVM
XC2 = baggingR(D, Y, X2, sklearn.svm.SVR(), 30)
print "SVM:"
for xc2 in range(len(XC2)):
print X2[xc2], XC2[xc2]
# KNN
XC2 = baggingR(D, Y, X2, sklearn.neighbors.KNeighborsRegressor(), 30)
print "KNN:"
for xc2 in range(len(XC2)):
print X2[xc2], XC2[xc2]
# Random Forests
XC2 = baggingR(D, Y, X2, sklearn.ensemble.RandomForestRegressor(), 30)
print "Test 2:"
for xc2 in range(len(XC2)):
print X2[xc2], XC2[xc2]
# Bagging Regressor
XC2 = baggingR(D, Y, X2, sklearn.ensemble.BaggingRegressor(), 30)
print "Test 2:"
for xc2 in range(len(XC2)):
print X2[xc2], XC2[xc2]
you'd have to use the regressor for each since the classes are in a different format
The Classifiers are trying to sort it into a finite set of classes (A,B,C) whereas Regressors are trying to sort it into an infinite range of real numbers (anywhere from -10.0 to 10.0)
I'm a little confused by:
you'd have to use the regressor for each since the classes are in a different format
Makes sense that classifiers are finite by nature, and regressors are contained within the real number set. But, even though the real numbers within -10 to 10 is infinite, do you mean the restriction between the regression sets, is -INF to +INF? That doesn't make sense to me, is it even worth noting?
Also, I like the generalized outline you have above. I'm thinking maybe we can do one, or maybe two more concrete examples (one of each). Then, perhaps we can either rearrange all current issues, or simply dedicate a [release-1.0]
branch, and merge everything related to this research
into that branch. The latter makes more sense to me. For example, we can start making wrapper functions, to each of the above estimators. Then, retain the object, in case we want to supply them to the bootstrap aggregation methods. So, we'll probably need to make a wrapper to the bootstrap aggregation methods, along with each of the above estimators.
Though, it's not very essential that we do 1-2 examples, would be nice. If the BaggingClassifier
, and BaggingRegressor
accepts the returned object, of the corresponding estimators you've outlined, that's great news in itself. If that's the case, seems we are in a good position to create some wrappers, up to you.
Makes sense that classifiers are finite by nature, and regressors are contained within the real number set. But, even though the real numbers within -10 to 10 is infinite, do you mean the restriction between the regression sets, is -INF to +INF? That doesn't make sense to me, is it even worth noting?
I just mean that while Classifiers use a fixed set of classes, using only things provided in Y, whereas Regressors use real numbers, which can potentially include things not explicitly stated in Y. So if Y looks like [3.4, 2.5, 2.2], a Classifier could only predict those 3 numbers but a Regressor would be able to predict any real number in the range provided.
So if you were to attempt to use a Regressor in a BaggingClassifier, the Regressor could potentially return 2.23 as it's "vote" and the Classifier wouldn't understand this because it thinks the only possible votes are "3.4", "2.5", and "2.2".
Similarly, if you attempted to use a Classifier in a Regressor, the Classifier is built to handle distinct classes, including strings, so sklearn prevents you from feeding the output into a Bagging Regressor because it wouldn't make sense to do so.
You could theoretically use the Y values [1,2,3] for both classifiers and regressors, but it wouldn't really make sense to do feed a Classifier into a Regressor from a math perspective cause you'd get results like 1.5 and it'd be meaningless if all the training data was classified into integers
I think it's possible to obtain values, of real numbers, outside the range of your (data)set, especially for poorly trained models. In the simplest linear case, you'd hope prediction results to be on that nice line (at least within some reasonable epsilon). But, sometimes the value you obtain doesn't quite fit that line, and there is some error. Similarly, on multi-dimensional regression models, you'll have similar potential errors. But, you probably meant to say that it could be any real number within the set, or otherwise. That's a nice tie together with Y
. It's been sometime, since I have had the chance to review these concepts.
Ohh I get what you were saying
Yea I think it's possible to get numbers outside the range, you'd just have to ask it to predict some outrageous point in comparison to your training data