pyod
pyod copied to clipboard
IndexError for COF
Dear contributor(s),
I am trying to run the code below and an index error occurs (see below).
The code works when n_neighbors = 49
which tells me that for some reason the maximum number of neighbors depends on the size of the test set that is used. I found this weird behavior, since I expect the input parameters to only depend on the train set. Could you clarify this behavior?
Thanks
from pyod.models.cof import COF
mn = multivariate_normal(mean=[0,0], cov = [[1,0.75],[0.75,1]], seed=1)
X_train = mn.rvs( 1000 )
X_test = mn.rvs( 50 )
clf = COF(contamination=0.05, n_neighbors = 100, method='fast')
clf.fit(X_train)
scores_query_samples = clf.decision_function(X_test)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-94-d720bf6b3fce> in <module>
8 clf.fit(X_train)
9
---> 10 scores_query_samples = clf.decision_function(X_test)
11
12
~../pyod/models/cof.py in decision_function(self, X)
143 """
144 if self.method.lower() == "fast":
--> 145 return self._cof_fast(X)
146 elif self.method.lower() == "memory":
147 return self._cof_memory(X)
~../pyod/models/cof.py in _cof_fast(self, X)
199 for j in range(self.n_neighbors_):
200 cost_desc.append(
--> 201 np.min(dist_matrix[sbn_path[j + 1]][sbn_path][:j + 1]))
202 acd = []
203 for _h, cost_ in enumerate(cost_desc):
IndexError: index 50 is out of bounds for axis 0 with size 50
This behaviour is very correct.
When you write clf = COF(contamination=0.05, n_neighbors = 100, method='fast')
, you basically ask the algorithm to "consult" exactly 100 data samples to make the decision. This will be its new norm.
Now when you pass only 50
data samples to be tested according to this new norm, it's normal that you get IndexError
!.
John-Almardeny, thank you for you response. I will dive more into this algorithm to understand how it works. It only feels counter-intuitive since most algorithms use settings that only depend on the train set; for new points (test set) it is expected that only data points from the train set are "consulted". But I guess for COF this behavior does not hold.
I see your point. Unlike most of the other algorithms in PyOD
, the current implementation of the COF
does not have learning parameters. If you think there can be, then please feel free to submit a PR.
Thanks.
I think I see the 'problem' with this implementation, and the implementation is a bit weird to me.
During fitting the model the self.decision_function(X)
is called, where lets say X
is now X_train
:
https://github.com/yzhao062/pyod/blob/bd4905c92269a9d4f14e7ab19b3377a1d78ba9f5/pyod/models/cof.py#L122
Now, I expect the model to be 'fitted' based on data points in X_train
only, which is even stated in this line:
https://github.com/yzhao062/pyod/blob/bd4905c92269a9d4f14e7ab19b3377a1d78ba9f5/pyod/models/cof.py#L128
However, a closer look at the code reveals that when you just call decision_function(X)
after fitting, it will just re-run the whole process on the new input X
(test set). In other words, calling fit()
has no additional meaning.
I see the same happening for the SOD algorithm: https://github.com/yzhao062/pyod/blob/3336c9dfe9ebaf7df685ac81316a4cb79d6abbd8/pyod/models/sod.py
I am happy to recieve some clarifications on this.
Kind regards
Yeah, that's exactly what I'm saying. We're not saving any learning parameters. COF
and SOD
are implemented correctly as in their corresponding research papers where mostly there is no concept of fit
then predict
, yet PyOD
follows the Scikit-Learn
style with fit & predict
. The fit
requires saving learning parameters that have been learned from the Training-set
then to be used when the decision_function
is called, and which is not done in the aforementioned algorithms.
It was the same in the ROD
algorithm (I am the author of its paper), and I refactored it radically in PyOD
to make it follow the convention here - you can have a look at ROD
by the way to see a full example of how fit & decision_function
should be.
Again, if you can see learning parameters
in COF
or/and SOD
, please feel free to submit a PR.
Regards, Yahya.
Thank you for your response John-Almardeny, and thanks for the clarification. I think it is important for users to understand this behavior (such as me), since the scikit-learn styling in this case 'tricks' potential users in thinking that train and test test are separate, and that there are indeed learning parameters.
I would suggest at least to change some text in the docstrings: https://github.com/yzhao062/pyod/blob/bd4905c92269a9d4f14e7ab19b3377a1d78ba9f5/pyod/models/cof.py#L128
Were I would at least remove the part 'fitted detector', since this is not the case. I hope you would agree with me?