pyod icon indicating copy to clipboard operation
pyod copied to clipboard

IndexError for COF

Open mbongaerts opened this issue 2 years ago • 6 comments

Dear contributor(s),

I am trying to run the code below and an index error occurs (see below).

The code works when n_neighbors = 49 which tells me that for some reason the maximum number of neighbors depends on the size of the test set that is used. I found this weird behavior, since I expect the input parameters to only depend on the train set. Could you clarify this behavior?

Thanks

from pyod.models.cof import COF

mn = multivariate_normal(mean=[0,0], cov = [[1,0.75],[0.75,1]], seed=1)
X_train =  mn.rvs(  1000 ) 
X_test =  mn.rvs(  50 ) 

clf = COF(contamination=0.05, n_neighbors = 100, method='fast')
clf.fit(X_train)
scores_query_samples = clf.decision_function(X_test) 

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-94-d720bf6b3fce> in <module>
      8 clf.fit(X_train)
      9 
---> 10 scores_query_samples = clf.decision_function(X_test)
     11 
     12 

~../pyod/models/cof.py in decision_function(self, X)
    143         """
    144         if self.method.lower() == "fast":
--> 145             return self._cof_fast(X)
    146         elif self.method.lower() == "memory":
    147             return self._cof_memory(X)

~../pyod/models/cof.py in _cof_fast(self, X)
    199             for j in range(self.n_neighbors_):
    200                 cost_desc.append(
--> 201                     np.min(dist_matrix[sbn_path[j + 1]][sbn_path][:j + 1]))
    202             acd = []
    203             for _h, cost_ in enumerate(cost_desc):

IndexError: index 50 is out of bounds for axis 0 with size 50

mbongaerts avatar Apr 26 '22 12:04 mbongaerts

This behaviour is very correct. When you write clf = COF(contamination=0.05, n_neighbors = 100, method='fast'), you basically ask the algorithm to "consult" exactly 100 data samples to make the decision. This will be its new norm. Now when you pass only 50 data samples to be tested according to this new norm, it's normal that you get IndexError!.

John-Almardeny avatar Apr 26 '22 13:04 John-Almardeny

John-Almardeny, thank you for you response. I will dive more into this algorithm to understand how it works. It only feels counter-intuitive since most algorithms use settings that only depend on the train set; for new points (test set) it is expected that only data points from the train set are "consulted". But I guess for COF this behavior does not hold.

mbongaerts avatar Apr 27 '22 08:04 mbongaerts

I see your point. Unlike most of the other algorithms in PyOD, the current implementation of the COF does not have learning parameters. If you think there can be, then please feel free to submit a PR. Thanks.

John-Almardeny avatar Apr 27 '22 12:04 John-Almardeny

I think I see the 'problem' with this implementation, and the implementation is a bit weird to me.

During fitting the model the self.decision_function(X) is called, where lets say X is now X_train : https://github.com/yzhao062/pyod/blob/bd4905c92269a9d4f14e7ab19b3377a1d78ba9f5/pyod/models/cof.py#L122

Now, I expect the model to be 'fitted' based on data points in X_train only, which is even stated in this line: https://github.com/yzhao062/pyod/blob/bd4905c92269a9d4f14e7ab19b3377a1d78ba9f5/pyod/models/cof.py#L128

However, a closer look at the code reveals that when you just call decision_function(X) after fitting, it will just re-run the whole process on the new input X (test set). In other words, calling fit() has no additional meaning.

I see the same happening for the SOD algorithm: https://github.com/yzhao062/pyod/blob/3336c9dfe9ebaf7df685ac81316a4cb79d6abbd8/pyod/models/sod.py

I am happy to recieve some clarifications on this.

Kind regards

mbongaerts avatar Apr 28 '22 08:04 mbongaerts

Yeah, that's exactly what I'm saying. We're not saving any learning parameters. COF and SOD are implemented correctly as in their corresponding research papers where mostly there is no concept of fit then predict, yet PyOD follows the Scikit-Learn style with fit & predict. The fit requires saving learning parameters that have been learned from the Training-set then to be used when the decision_function is called, and which is not done in the aforementioned algorithms. It was the same in the ROD algorithm (I am the author of its paper), and I refactored it radically in PyOD to make it follow the convention here - you can have a look at ROD by the way to see a full example of how fit & decision_function should be.

Again, if you can see learning parameters in COF or/and SOD, please feel free to submit a PR.

Regards, Yahya.

John-Almardeny avatar Apr 28 '22 13:04 John-Almardeny

Thank you for your response John-Almardeny, and thanks for the clarification. I think it is important for users to understand this behavior (such as me), since the scikit-learn styling in this case 'tricks' potential users in thinking that train and test test are separate, and that there are indeed learning parameters.

I would suggest at least to change some text in the docstrings: https://github.com/yzhao062/pyod/blob/bd4905c92269a9d4f14e7ab19b3377a1d78ba9f5/pyod/models/cof.py#L128

Were I would at least remove the part 'fitted detector', since this is not the case. I hope you would agree with me?

mbongaerts avatar Apr 29 '22 06:04 mbongaerts