ADBench icon indicating copy to clipboard operation
ADBench copied to clipboard

Data set choice: pay attention to use unbalanced data

Open kno10 opened this issue 2 years ago • 3 comments

Data sets with 50% anomalies are not anomaly detection!

More data sets does not mean more meaningful results, because "garbage in, garbage out". One of the big problems with current anomaly detection research is that we do not use good data sets to evaluate results, hence everything works sometimes by chance, and there is little systematic benefits observable because the data sets are not properly labeled as anomalies. I am by now convinced that from most of the commonly used data sets, you cannot draw meaningful conclusions because of unsuitable labeling.

kno10 avatar Jun 25 '22 13:06 kno10

Dear Prof. Schubert, thanks for the note. Big fan of your work, e.g., ELKI and DAMI evaluation.

We fully agree with the quality issue(s) in current outlier research, although among the 55 datasets we only have parkinson which is not below 50% outliers. We debated a bit on whether to include this, and finally decided to follow the tradition to include it.

It should not affect the results too much for a few reasons:

  1. one dataset will not play a major role in the analysis
  2. (again) the conclusions are mainly based on the statistical analysis
  3. we also add 7 new datasets (all with 5% anomalies) to enrich the testbed

We are happy to remove parkinson and redo the analysis for the results in the next revision. Thanks again for the heads up. Please also share any additional thoughts for the paper (https://www.andrew.cmu.edu/user/yuezhao2/papers/22-preprint-adbench.pdf). Appreciate that!

yzhao062 avatar Jun 25 '22 13:06 yzhao062

Over 50% would be even more weird, but in fact anything beyond 5% is unrealistic! If you would use anomaly detection in a business setting, and you had even 5% anomalies in your measurements, your system would be unusable. In a realistic scenario, you will likely have <1% or <0.1% anomalies (and even much less when considering time series as individual points - at a measurement rate of 100 Hz, 1% anomalies would mean one anomaly per second), if your consider that an anomaly may mean the machine failed, or the resulting product is defect. To be useful in practical applications, the methods must be evaluated on a scenario where anomalies are very rare. And class labels are usually unsuitable for outlier detection. Classes themselves may have outliers, and a rare class does not need to be outliers, but can be a highly concentrated blob - or undetectable, because the class is distributed everywhere. Classification data must only be used with care for anomaly detection - and with close human inspection of the results. Outlier detection researchers must stop treating data as a black box, and caring only about the evaluation score; instead they must verify the proposed methods can solve a real problem, not just overfit some benchmark data with parameter tuning. I only had a brief look, and in my opinion the paper essentially shows that we can_not_ learn much from this benchmark. Almost all methods are tied within the critical difference; i.e., outlier benchmarks needs to be improved in quality, not quantity. For example, there is the huge repository of "outlier" data sets at Monash university: https://researchdata.edu.au/datasets-outlier-detection/1370700 with 12338 datasets (missing from your table 1; e.g., On normalization and algorithm selection for unsupervised outlier detection, Sevvandi Kandanaarachchi, Mario A. Munoz, Rob J. Hyndman, Kate Smith-Miles) - in my opinion it suffers from the same problem that the majority of data sets are "distractors", and the entire signal gets lost by adding all those automatically collected data sets; so that in the end we do not learn that much on the algorithms, but rather that our data set collection is unsuitable. What we need is an "imagenet moment" for anomaly detection: someone that labels a lot of data for actual rare anomalies, instead of abusing data from other domains, downsampling it, and then just looking at the resulting score, and ignoring the real task for this data set.

kno10 avatar Jun 25 '22 17:06 kno10

That is a great point! That is exactly why we pick 5% as a threshold for new datasets (although they are still adapted).

Regarding the absence of more realistic datasets, this is a long-overdue problem. It may be even worse for other data types like graphs, where researchers often inject anomalies to detect. It will be nice if we could collaboratively design some large-scale benchmark datasets. However, it is not an easy task. According to my personal experience with multiple Canadian banks, both the data and the labels are hard to acquire, let alone be released for research.

I may also point to some additional thoughts. First, outliers may be a collection of interesting samples, so the exact percentage is hard to define. 5% sounds nice but it can vary...and maybe 10% also happens in some real-world applications. Second, the number of benchmark datasets still helps, since many of these adapted datasets have naturally semantic meaning. For instance, the binary classification of a rare disease often makes sense in the outlier detection setting. With a large number of datasets, we could run statistical analysis. But I agree more consideration should be put into grouping different datasets. Another reason for considering all these datasets is since many outlier papers are based on a subset of them, our paper reveals that none of these unsupervised methods are actually statistically better. Of course, other interesting points on different perspectives of OD are also discussed in the paper. Happy to see more thoughts on them :)

Again, want to say thanks to your work since a considerate number of datasets in ADBench is based on your DAMI repository and the meta-analysis paper by OSU (https://ir.library.oregonstate.edu/concern/datasets/47429f155?locale=en). Honestly, I think nearly all outlier detection benchmarks are based on these datasets for now, and the values are still high.

I do wish we could collaborate on building more "image-net" benchmark datasets for outlier detection, which can be either based on industry support or more carefully review and selection on existing datasets. I feel many of the existing ones are still suited for outlier detection, given the definition of outliers is naturally vague.

All in all, thanks for interesting discussion and hope to collaborate on something toward this direction. Large-scale analysis and benchmark indeed lays the foundation for fair comparison and model selection. Some of our recent papers are for this (https://openreview.net/forum?id=KCd-3Pz8VjM) :)

yzhao062 avatar Jun 25 '22 17:06 yzhao062