pyod
pyod copied to clipboard
HBOS Categorical Implementation
All Submissions Basics:
Closes #21
- [x] Have you followed the guidelines in our Contributing document?
- [x] Have you checked to ensure there aren't other open Pull Requests for the same update/change?
- [x] Have you checked all Issues to tie the PR to a specific one?
All Submissions Cores:
- [x] Have you added an explanation of what your changes do and why you'd like us to include them?
- [x] Does your submission pass tests, including CircleCI, Travis CI, and AppVeyor?
New Model Submissions:
- [x] Have you created a <NewModel>_example.py in ~/examples/?
- [x] Have you lint your code locally prior to submission?
Description
There are several ways to convert categorical values to numerical ones in a given dataset, so HBOS
can work with it.
I implemented 3 ways, left the option to the user to specify which by changing parameter category
that has been added to HBOS
Class.
Methods are:
- One Hot Encoding.
- Label Encoding.
- Frequency Ratio Encoding.
Since, as far as I am ware of, PyOD does not provide synthesized categorical data (can be added on the list for future work ;-) ) , I tested the implementation on 3 different real-world categorical datasets, namely: Breast Cancer, Car Evaluation, Tic Tac Toe. Which can be found in HBOS_categorical_example.py
file.
Pull Request Test Coverage Report for Build 915
- 15 of 39 (38.46%) changed or added relevant lines in 1 file are covered.
- No unchanged relevant lines lost coverage.
- Overall coverage decreased (-0.6%) to 94.794%
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
---|---|---|---|
pyod/models/hbos.py | 15 | 39 | 38.46% |
<!-- | Total: | 15 | 39 |
Totals | |
---|---|
Change from base Build 911: | -0.6% |
Covered Lines: | 3696 |
Relevant Lines: | 3899 |
💛 - Coveralls
Pull Request Test Coverage Report for Build 1022
- 56 of 63 (88.89%) changed or added relevant lines in 2 files are covered.
- No unchanged relevant lines lost coverage.
- Overall coverage decreased (-0.1%) to 95.642%
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
---|---|---|---|
pyod/models/hbos.py | 31 | 38 | 81.58% |
<!-- | Total: | 56 | 63 |
Totals | |
---|---|
Change from base Build 999: | -0.1% |
Covered Lines: | 4038 |
Relevant Lines: | 4222 |
💛 - Coveralls
Hi Yue @yzhao062 Please check PR :) No more tests can be added. BR, Yahya.
Hi @LarsNeR Sorry but I cannot get your point. Can you be more specific, like are ya getting specific error, or are you asking about the logic behind it and how it works? Also, if your local dataset is not confidential, can you please share it so I can investigate more?
Kind Regards, Yahya.
@John-Almardeny sorry deleted the first post because I thought I am wrong
I tried this on my local dataset. I noticed that the clf.predict(xi)
is not so easy. The categorical data of the data point xi
has to be converted somehow. But the logic of it is hidden in the classifier and is not used by .predict()
. Is this correct? At least I don't see this in the code and two different results come out if I predict the original data point (with categorical data) and the manually transformed one (I used frequency encoding).
Edit: No I am not getting any error but the results are different. This is how I am calling it:
clf = HBOS(category='frequency')
clf.fit(X_train)
// Doing the encoding on my own
unique_, _counts = np.unique(X_train, return_counts=True)
freq = dict(zip(unique_, _counts / X_train.shape[0]))
func = np.vectorize(lambda x: freq.get(x))
X_train_enc = func(X_train)
clf.predict(X_train_enc[0].reshape(1, -1)) # -> array([0])
clf.predict(X_train[0].reshape(1, -1)) # -> array([1])
@LarsNeR
Yes the logic is hidden inside predict()
function. It checks first if the dataset is categorical or numerical. If it is the later, it will continue as normal without changing anything in the original HBOS
implementation. However, if it is categorical, it then changes it into numerical following one of the above mentioned schemes (e.g. One-Hot Encoding..etc) then returns it to HBOS
to work on it as if it was originally numerical.
Please define what is "manually transformed dataset" and it will be better if you provide a snippet of code. Thanks.
@John-Almardeny I just added code. And it is the same for the example car dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"
raw_data = urlopen(url)
dataset = np.loadtxt(raw_data, delimiter=",", dtype=str)
X, Y = dataset[:, range(dataset.shape[1] - 1)], \
[1 if i in ('good', 'vgood') else 0 for i in dataset[:, dataset.shape[1] - 1]]
clf = HBOS(category='frequency')
clf.fit(X)
# Doing exactly what 'frequency' does in the fit method
unique_, _counts = np.unique(X, return_counts=True)
freq = dict(zip(unique_, _counts / X.shape[0]))
func = np.vectorize(lambda x: freq.get(x))
X_enc = func(X)
clf.predict(X[0].reshape(1,-1)) # original data, not encoded -> array([1])
clf.predict(X_enc[0].reshape(1,-1)) # encoded data -> array([0])
As far as I understand you, I am calling it the wrong way because I should only call it with the original data point and the predict()
function does all the encoding magic, right?
But then I don't understand how it can work for the other two encoders. Because OneHotEncoder and LabelEncoder always call fit_transform
on the new datapoint? Shouldn't they be fittet once (when calling fit
) and then only transform
? I am not saying that anything is wrong, but I am trying to understand it.
@LarsNeR I see now.
You are right, that is because X
internally (i.e. inside predict()
) is transformed.
So the first clf.predict(X[0].reshape(1,-1))
is predicting on the input ['vhigh' 'vhigh' '2' '2' 'small' 'low']
before conversion, thus the decision_function()
is converting it to these frequencies [[2. 2. 2. 2. 1. 1.]]
and considering it as a whole dataset. Whereas, clf.predict(X_enc[0].reshape(1,-1))
is predicting on the right frequencies coming from your manual conversion, giving [[0.5 0.5 0.58333333 0.58333333 0.33333333 0.83333333]]
which is the same value inside fit()
for the same observation.
So, thanks very much for pointing it out, All need be done is to return X
in fit()
function and all will be good (will push it in a commit after a while), then your example will give the same result as follows:
clf = HBOS(category='frequency')
X_ = clf.fit(X) # <--- here I am getting X after conversion
# Doing exactly what 'frequency' does in the fit method
unique_, _counts = np.unique(X, return_counts=True)
freq = dict(zip(unique_, _counts / X.shape[0]))
func = np.vectorize(lambda x: freq.get(x))
X_enc = func(X)
print(clf.predict(X_[0].reshape(1, -1)) ) # prediction should be on X_ --> array([0])
print(clf.predict(X_enc[0].reshape(1, -1))) # encoded data -> array([0])
Side-note: Please consider updating scikit-learn since they are changing their code base frequently: sudo -H pip install -U scikit-learn
@John-Almardeny yes that describes it really well. But that would not work for completely new datapoints then, right? So if after fitting the classifier a new datapoint comes in and we want to predict that.
@LarsNeR No it will not, because the frequencies are calculated based on the original dataset (which might be thousands of observations and tens of categories), then the new unseen observation might contain new category. You can consider the existing implementation as Version 1. I will add the features you mentioned to the To-Do list.
Many thanks. Yahya.
@John-Almardeny Okay great. Thank you now I understood it better too.
Hi @yzhao062 If this is not going anywhere please lete know so I can close it!
Nah it should be fine but it is just too old and I could merge easily. would you mind rebasing it with the current development branch? We could then merge it easily. Similar for other PR as well :) I am also trying to clear the pipe