scikit-learn-extra
scikit-learn-extra copied to clipboard
[WIP] add Voronoi Isolation Forest implementation
Thanks in advance to all reviewers who will spend their time reading this pull request.
What?
I have implemented Voronoi Isolation Forest. It is an Anomaly Detection algorithm based on the isolation approach, similar to the already present sklearn.ensemble.IsolationForest. In short, an ensemble of trees is constructed, each one representing a nested Voronoi tessellation. The average depth is used to compute the anomaly scores of the samples. A description of the algorithm can be found in the following article, recently accepted at the 25th International Conference on Pattern Recognition (ICPR2020).
Why?
Voronoi Isolation Forest has a greater breadth of applicability compared to Isolation Forest, as it solves two of its main problems:
- Voronoi Isolation Forest is not constrained to generate axis-parallel splits, while Isolation Forest is constrained to axis-parellel splits.
- Voronoi Isolation Forest is able to work with any metric (also learned metrics/kernels), while Isolation Forest is intrinsically meant to be used in Euclidean spaces.
Additional comments
I believe that this algorithm can benefit many people interested in Data Mining, both for work and for passion. I am available for any code correction and to produce any documentation (including demos) to allow users to grasp the usefulness and applicability of the algorithm.
Thanks again for your attention.
Thank you for this contribution @ineveLoppiliF !
A few comments,
-
ci/circleci: lint
is currently failing because you would need to apply the black auto-fromatter on the code
and to produce any documentation (including demos) to allow users to grasp the usefulness and applicability of the algorithm.
yes, we would indeed need a section in the user manual describing this algorithm, and a motivating example (e.g. examples/plot_voronoi_iforest.py
) comparing results with IsolationForest to be able to merge this PR.
Also we would also need unit tests (under sklearn_extra/ensemble/test_voronoi_iforest.py
) see for instance sklearn/ensemble/tests/test_iforest.py
). Also please add it to https://github.com/scikit-learn-contrib/scikit-learn-extra/blob/0a2615cb20de822940edc2184ac80929fc90f93f/sklearn_extra/tests/test_common.py#L22 so it's included in estimator checked with common tests.
How would be the runtime performance with respect to the IsolationForest in scikit-learn?