pyod icon indicating copy to clipboard operation
pyod copied to clipboard

ECOD behavior on new data

Open wawrzenczyka opened this issue 2 years ago • 4 comments

Hi, after using ECOD algorithm as a part of my research, similarly to issue #401 I believe current behavior of ECOD is really unintuitive. I believe that any person using the PyOD library expects a common interface - fit method to calculate and set classifier parameters (especially after performing some calculations) and decision_function method to return the scores for any data, not only the training dataset.

Obviously, there are some benefits of merging training and test dataset during prediction. Obtained scores are unique for each dimension, which distributes scores better - when we omit concatenation, we are limited to the discrete scores dictated by the training dataset size.

I believe, however, that the current behavior is out of place in the general-purpose OD library. I prepared a simple example, which demonstrates the underlying problem. Let's assume we train ECOD on a Gaussian cloud of 1000 points, with center in (10, 0) and std=1. Now we want to evaluate ECOD on a 250x250 2D grid centered in (0, 0) and take a look for the (normalized between 0 and 1) outlier score in each point. We would expect the scores to be high around the (10, 0) and decrease outward... Wrong. ECOD_test_250

The points in the test space, when concatenated to the training dataset, skew the learned ECDF significantly. When we have a high number of test samples, the training dataset stops to matter - the estimated ECDF is centered around (0, 0). In this case we can partially correct this behavior by decreasing the resolution: ECOD_test_25

But overall, such behavior is highly undesirable for an outlier detection algorithm - it is unintuitive, depends on the size of the test dataset, and causes the decision score of each sample to depend not only on the training dataset and sample itself, but also other samples in the test set. I believe that modification of the current implementation is necessary - this can be fixed without changing the core philosophy of the algorithm, by using the training dataset only during the training to calculate ECDF parameters for each dimension, and only evaluating the previously trained ECDFs when calculating the decision function. In my modified implementation, I applied those changes, which leads to accurate (to the extent allowed by the low number of training samples) estimation of the original distribution in the example above, as well as a clear separation between the training and evaluation processes: ECOD_test_250_corrected

In the issue #401, however, it is clearly stated the the current behavior is expected. I could prepare the pull request containing my modifications (incorporating ECOD part of issue #408 on the way), but I'm not sure whether or not is it desirable to change the current behavior in this repository. I would appreciate any feedback from the repository owner on the issue.

Thanks

wawrzenczyka avatar Jul 19 '22 13:07 wawrzenczyka

So, I recently implemented the R-graph method for pyOD. R-graph also does not have trainable parameters. So each time you want to evaluate a test set you need to 'refit' the method on the train set + test set. My way of dealing with this problem was to concatenate only a few samples of the test set with the train set. See here the implementation: https://github.com/yzhao062/pyod/blob/f6029d57a2ebce88b0af03b4d32e6a77492dd5e3/pyod/models/rgraph.py#L484-L523

The user can decide how many samples should be included when concatenating (i.e. blocksize_test_data). My opinion would be to do the same for all other methods that do not have trainable parameter.

Happy to hear other opinions.

mbongaerts avatar Sep 22 '22 09:09 mbongaerts

Thanks for this analysis. Very helpful. I also agree that in the ideal case test samples should be somehow independent while predicting. So whether train+test or train+a single test sample can be argued. While most of the sklearn and pyod methods follow the letter. If you do not mind we could have a different prediction approach as a hyperparameter of ECOD. That is great if you could add a customized predict function and we could call it based on the setting of the users. From a philosophical perspective, if test samples are already so different from the train, then we should probably not use the train. But I agree both arguments make sense. Cheers~

yzhao062 avatar Sep 23 '22 12:09 yzhao062

@wawrzenczyka I'm facing the same issue. By any chance you can share your modification? Thanks heaps.

simon19891101 avatar Nov 29 '23 02:11 simon19891101

@wawrzenczyka I'm facing the same issue. By any chance you can share your modification? Thanks heaps.

Hey, at some point I lost interest in preparing a full PR here, and apparently did not commit the partial changes I've started at the time to my fork. You can, however, look up the "quick and dirty" modified implementation of ECOD that I personally use here: https://github.com/wawrzenczyka/FOR-CTL/blob/master/ecod_v2.py. Other that the issue described here, I've incorporated additional fix to the algorithm - in the code, element-wise maximum was used before summing up the tail probabilities (O_left, O_right and O_auto), which doesn't match description from the steps 6 and 7 of the Algorithm 1 in the paper. I did not modify some parts of the code at all, e.g. the explain_outlier function, as I did not use it. There were also some bigfixes in this repo which I did not incorporate, e. g. this. If you want to clean up those changes and incorporate into this repo, feel free to do so - I would be happy with the credit. If there is interest, I might take a second swing at doing so myself some day.

wawrzenczyka avatar Nov 29 '23 06:11 wawrzenczyka