pyod icon indicating copy to clipboard operation
pyod copied to clipboard

HBOS Categorical Implementation

Open John-Almardeny opened this issue 5 years ago • 13 comments

All Submissions Basics:

Closes #21

  • [x] Have you followed the guidelines in our Contributing document?
  • [x] Have you checked to ensure there aren't other open Pull Requests for the same update/change?
  • [x] Have you checked all Issues to tie the PR to a specific one?

All Submissions Cores:

  • [x] Have you added an explanation of what your changes do and why you'd like us to include them?
  • [x] Does your submission pass tests, including CircleCI, Travis CI, and AppVeyor?

New Model Submissions:

  • [x] Have you created a <NewModel>_example.py in ~/examples/?
  • [x] Have you lint your code locally prior to submission?

Description

There are several ways to convert categorical values to numerical ones in a given dataset, so HBOS can work with it.

I implemented 3 ways, left the option to the user to specify which by changing parameter category that has been added to HBOS Class. Methods are:

  1. One Hot Encoding.
  2. Label Encoding.
  3. Frequency Ratio Encoding.

Since, as far as I am ware of, PyOD does not provide synthesized categorical data (can be added on the list for future work ;-) ) , I tested the implementation on 3 different real-world categorical datasets, namely: Breast Cancer, Car Evaluation, Tic Tac Toe. Which can be found in HBOS_categorical_example.py file.

John-Almardeny avatar May 14 '19 14:05 John-Almardeny

Pull Request Test Coverage Report for Build 915

  • 15 of 39 (38.46%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.6%) to 94.794%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pyod/models/hbos.py 15 39 38.46%
<!-- Total: 15 39
Totals Coverage Status
Change from base Build 911: -0.6%
Covered Lines: 3696
Relevant Lines: 3899

💛 - Coveralls

coveralls avatar May 14 '19 15:05 coveralls

Pull Request Test Coverage Report for Build 1022

  • 56 of 63 (88.89%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.1%) to 95.642%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pyod/models/hbos.py 31 38 81.58%
<!-- Total: 56 63
Totals Coverage Status
Change from base Build 999: -0.1%
Covered Lines: 4038
Relevant Lines: 4222

💛 - Coveralls

coveralls avatar May 14 '19 15:05 coveralls

Hi Yue @yzhao062 Please check PR :) No more tests can be added. BR, Yahya.

John-Almardeny avatar May 14 '19 16:05 John-Almardeny

Hi @LarsNeR Sorry but I cannot get your point. Can you be more specific, like are ya getting specific error, or are you asking about the logic behind it and how it works? Also, if your local dataset is not confidential, can you please share it so I can investigate more?

Kind Regards, Yahya.

John-Almardeny avatar Jul 04 '19 08:07 John-Almardeny

@John-Almardeny sorry deleted the first post because I thought I am wrong

I tried this on my local dataset. I noticed that the clf.predict(xi) is not so easy. The categorical data of the data point xi has to be converted somehow. But the logic of it is hidden in the classifier and is not used by .predict(). Is this correct? At least I don't see this in the code and two different results come out if I predict the original data point (with categorical data) and the manually transformed one (I used frequency encoding).

Edit: No I am not getting any error but the results are different. This is how I am calling it:

clf = HBOS(category='frequency')
clf.fit(X_train)

// Doing the encoding on my own
unique_, _counts = np.unique(X_train, return_counts=True)
freq = dict(zip(unique_, _counts / X_train.shape[0]))
func = np.vectorize(lambda x: freq.get(x))
X_train_enc = func(X_train)

clf.predict(X_train_enc[0].reshape(1, -1)) # -> array([0])
clf.predict(X_train[0].reshape(1, -1)) # -> array([1])

LarsNeR avatar Jul 04 '19 08:07 LarsNeR

@LarsNeR Yes the logic is hidden inside predict() function. It checks first if the dataset is categorical or numerical. If it is the later, it will continue as normal without changing anything in the original HBOS implementation. However, if it is categorical, it then changes it into numerical following one of the above mentioned schemes (e.g. One-Hot Encoding..etc) then returns it to HBOS to work on it as if it was originally numerical.

Please define what is "manually transformed dataset" and it will be better if you provide a snippet of code. Thanks.

John-Almardeny avatar Jul 04 '19 08:07 John-Almardeny

@John-Almardeny I just added code. And it is the same for the example car dataset

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"
raw_data = urlopen(url)
dataset = np.loadtxt(raw_data, delimiter=",", dtype=str)

X, Y = dataset[:, range(dataset.shape[1] - 1)], \
       [1 if i in ('good', 'vgood') else 0 for i in dataset[:, dataset.shape[1] - 1]]

clf = HBOS(category='frequency')
clf.fit(X)

# Doing exactly what 'frequency' does in the fit method
unique_, _counts = np.unique(X, return_counts=True)
freq = dict(zip(unique_, _counts / X.shape[0]))
func = np.vectorize(lambda x: freq.get(x))
X_enc = func(X)

clf.predict(X[0].reshape(1,-1)) # original data, not encoded -> array([1])
clf.predict(X_enc[0].reshape(1,-1)) # encoded data -> array([0])

As far as I understand you, I am calling it the wrong way because I should only call it with the original data point and the predict() function does all the encoding magic, right? But then I don't understand how it can work for the other two encoders. Because OneHotEncoder and LabelEncoder always call fit_transform on the new datapoint? Shouldn't they be fittet once (when calling fit) and then only transform? I am not saying that anything is wrong, but I am trying to understand it.

LarsNeR avatar Jul 04 '19 09:07 LarsNeR

@LarsNeR I see now. You are right, that is because X internally (i.e. inside predict()) is transformed. So the first clf.predict(X[0].reshape(1,-1)) is predicting on the input ['vhigh' 'vhigh' '2' '2' 'small' 'low'] before conversion, thus the decision_function() is converting it to these frequencies [[2. 2. 2. 2. 1. 1.]] and considering it as a whole dataset. Whereas, clf.predict(X_enc[0].reshape(1,-1)) is predicting on the right frequencies coming from your manual conversion, giving [[0.5 0.5 0.58333333 0.58333333 0.33333333 0.83333333]] which is the same value inside fit() for the same observation.


So, thanks very much for pointing it out, All need be done is to return X in fit() function and all will be good (will push it in a commit after a while), then your example will give the same result as follows:

clf = HBOS(category='frequency')
X_ = clf.fit(X) # <--- here I am getting X after conversion

# Doing exactly what 'frequency' does in the fit method
unique_, _counts = np.unique(X, return_counts=True)
freq = dict(zip(unique_, _counts / X.shape[0]))
func = np.vectorize(lambda x: freq.get(x))
X_enc = func(X)

print(clf.predict(X_[0].reshape(1, -1)) ) # prediction should be on X_ --> array([0]) 
print(clf.predict(X_enc[0].reshape(1, -1)))  # encoded data -> array([0]) 

Side-note: Please consider updating scikit-learn since they are changing their code base frequently: sudo -H pip install -U scikit-learn

John-Almardeny avatar Jul 04 '19 09:07 John-Almardeny

@John-Almardeny yes that describes it really well. But that would not work for completely new datapoints then, right? So if after fitting the classifier a new datapoint comes in and we want to predict that.

LarsNeR avatar Jul 04 '19 09:07 LarsNeR

@LarsNeR No it will not, because the frequencies are calculated based on the original dataset (which might be thousands of observations and tens of categories), then the new unseen observation might contain new category. You can consider the existing implementation as Version 1. I will add the features you mentioned to the To-Do list.

Many thanks. Yahya.

John-Almardeny avatar Jul 04 '19 10:07 John-Almardeny

@John-Almardeny Okay great. Thank you now I understood it better too.

LarsNeR avatar Jul 04 '19 10:07 LarsNeR

Hi @yzhao062 If this is not going anywhere please lete know so I can close it!

John-Almardeny avatar Aug 15 '21 17:08 John-Almardeny

Nah it should be fine but it is just too old and I could merge easily. would you mind rebasing it with the current development branch? We could then merge it easily. Similar for other PR as well :) I am also trying to clear the pipe

yzhao062 avatar Aug 15 '21 17:08 yzhao062