imbalanced-learn icon indicating copy to clipboard operation
imbalanced-learn copied to clipboard

[WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification

Open EvgeniDubov opened this issue 5 years ago • 32 comments

Reference Issue

[sklearn] Feature Request: Hellinger split criterion for classificaiton trees #9947

What does this implement/fix? Explain your changes.

Hellinger Distance as tree split criterion, cython implementation compatible with sklean tree based classification models

Any other comments?

This is my first submission, sorry in advance for the many possible things I've missed. Looking forward for your feedback.

EvgeniDubov avatar Jul 11 '18 19:07 EvgeniDubov

Hello @EvgeniDubov! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2019-12-23 09:29:09 UTC

pep8speaks avatar Jul 11 '18 19:07 pep8speaks

Codecov Report

Merging #437 into master will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #437   +/-   ##
=======================================
  Coverage   98.83%   98.83%           
=======================================
  Files          86       86           
  Lines        5317     5317           
=======================================
  Hits         5255     5255           
  Misses         62       62

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update fc9e483...7acac1f. Read the comment docs.

codecov[bot] avatar Jul 11 '18 20:07 codecov[bot]

Cool. I am looking forward for this contribution. Could you make a quick todo list in the first summary.

glemaitre avatar Jul 11 '18 21:07 glemaitre

From what I see:

  • [ ] We did not have any Cython up to now. So you will need to setup the project for it. You can refer to https://github.com/jakevdp/cython_template
  • [ ] You have to solve the issues raised by PEP8.
  • [ ] I think that we need to improve the example with some narrative documentation.
  • [ ] We need to have a section in the User Guide such that people can find out about this feature and in which case to use it. Probably it could be embedded in the same section than the BalancedBaggingClassifier.
  • [ ] You can add a what's new entry as well.

glemaitre avatar Jul 11 '18 21:07 glemaitre

I've pushed the code for cython build support, and all the automatic checks failed. LGTM failed because my implementation assumes the existence of sklearn/tree/_criterion.pxd and it is missing. AppVeyor and Travis failed because Cython package is not installed. I assume that the last will fail on the missing .pxd file too after Cython dependency will be resolved.

Please let me know if there is a way for me to configure these tools or are they administrated by the maintainers only.

EvgeniDubov avatar Jul 16 '18 12:07 EvgeniDubov

@EvgeniDubov I am getting some time to look at this PR. I will probably make a force push to have a different building system for the Cython but this is not the most important thing.

I was looking at the literature and the original paper. I did not find a clear statement on how to compute the distance in a multiclass problem which is actually supported by the trees in scikit-learn.

@EvgeniDubov @DrEhrfurchtgebietend do you have a reference for multiclass?

glemaitre avatar Aug 30 '18 13:08 glemaitre

I have not done much multi-class classification. I do not even know how it is implemented with the traditional split criterion. Is it possible to set this up to only work for binary classification? Can we release without solving this?

Gitman-code avatar Aug 30 '18 18:08 Gitman-code

@glemaitre ineed sklearn's 'gini' and 'entropy' support multiclass but hellinger requires some modification to support it. Here is a quote from an abstract of a paper on this subject

In this paper we study the multi-class imbalance problem as it relates to decision trees (specifically C4.4 and HDDT), and develop a new multi-class splitting criterion. From our experiments we show that multi-class Hellinger distance decision trees, when combined with decomposition techniques, outperform C4.4.

I can contribute it as a separate Cython implementation, preferably in a separate PR.

EvgeniDubov avatar Sep 03 '18 11:09 EvgeniDubov

Oh nice, I did not look at this paper but the 2009 and 2011. It seems that I missed it.

I can contribute it as a separate Cython implementation, preferably in a separate PR.

I think that we should have the multi-class criterion directly. The reason is that we don't have a mechanism for raising an error if the criterion is used for a multiclass problem. However, it seems that it is quite feasible to implement the algorithm 1 in the paper that you attached.

Regarding the cython file, could you take all the cython setup from https://github.com/glemaitre/imbalanced-learn/commit/27fffea721fe1edafc50d1cfc65d1b857292b6c1 and just paste your criterion file and tests at the right location (+documentation). I prefer to be closer to those Cython setup (basically the major change is about the package naming).

glemaitre avatar Sep 03 '18 12:09 glemaitre

I am curious if we need to do something specific for how feature importance will be calculated after this change is done. There are two questions here. First, does the standard method of the sum of improvement in the criterion really generalize to all criterion. I think the answer is yes but if so then it might not be the case that this is really the definition we want. In an imbalanced case we would in theory have imbalanced features (ie nearly all the same value) which if important would be used high in the tree but not frequently. This would result in a low weight under the current definition. Would a definition of the average gain when used instead of the total gain across all uses be better? To limit discussion here I put this into a SO post.

Gitman-code avatar Sep 10 '18 22:09 Gitman-code

There is 3 points to consider:

  • The feature importance at a node is normalized by the weighted number of samples at that node.
  • The information gain on the top of the tree will be more important than on the bottom of the tree. So a feature used on the top of the tree might be more important than a feature used several time at the bottom of the tree, if the information gain is actually lower.
  • The feature importance computed in the tree is a bias estimator: http://explained.ai/rf-importance/index.html and there is probably nothing to do about that apart of doing a permutation test.

glemaitre avatar Sep 11 '18 08:09 glemaitre

Thanks for the feedback @glemaitre .

So you agree that different split criteria could be used to calculate the feature importance in general? It seems to intuitively make sense that if it is used to build the tree it makes sense to use it to define importance.

The weighting of the importance by the number of samples at the node was sort of what got me thinking down this path. Hellinger distance is designed to be less sensitive to number of samples but I think that is only a factor in finding the split.

The permutation feature importance is a great method. I see that there are discussions to move it to sklearn.

The purpose of thinking of feature importance in this way is to make sure one does not eliminate features which are unimportant in general but crucial in a few rare outlier cases. When doing feature selection is it easy to justify dropping such features when looking at aggregate metrics like RMSE since changes to only a few predictions will only alter it by a tiny amount. Permutation feature importance would not be sensitive to this either. Or at least it will only be as sensitive as metric you use for evaluation is to such outliers. Do you know of any standard metric for identifying features of this type? Sorry this has gotten a little off topic.

Gitman-code avatar Sep 11 '18 18:09 Gitman-code

Hello,

I have also conducted some investigations in utilizing the Hellinger Distance as a DT impurity criterion. I have my own cython implementation incorporated into a fork of sklearn along with published experimental results utilizing it as a baseline (direct comparison of my implementation with existing performance on data sets has shown it working exactly as in Chawla, 2008).

My cython implementation into _criterion.pyx from last march can be found here if it can be of any help! Fork

Please reach out if you have any questions or need help on incorporating the criterion.

Best, Andriy

AndriyMulyar avatar Oct 02 '18 19:10 AndriyMulyar

@EvgeniDubov I see that you did some changes. Could you avoid the introduction of new lines in the file which are not related to the PR (even if this is for the PEP8).

glemaitre avatar Oct 02 '18 20:10 glemaitre

@AndriyMulyar The current implementation is also fine. The only issue, as in your fork, the criterion is not working for multi-class. However, the paper pointed by @EvgeniDubov discuss how to generalize the criterion for the multi-class classification case.

glemaitre avatar Oct 02 '18 20:10 glemaitre

@glemaitre You are right - I have read the paper. I inserted a check into tree.py to resolve this. If I could bring up one consideration with including the multi-class implementation as opposed to simply the binary implementation:

The authors in the section titled 'Analysis of the Splitting Criteria' immediately following the algorithmic description discuss the relative dismal performance of the Hellinger distance as an impurity metric in the multi-class setting. The conclusion conflates 'While MC-HDDT does not statistically significantly outperform C4.4, it is a reasonable alternative to C4.4 for all classification problems' but this is a rather optimistic statement.

Something to consider.

Best, Andriy

AndriyMulyar avatar Oct 02 '18 20:10 AndriyMulyar

@glemaitre sure, I was merging the cython build code from your branch and got carried away. Please let me know what additional changes should I do in order to help make this PR ready for merge.

EvgeniDubov avatar Oct 06 '18 17:10 EvgeniDubov

@EvgeniDubov The CI are not happy if you can check that.

glemaitre avatar Oct 07 '18 18:10 glemaitre

Something to consider.

Until we also propose something which is working for multi-class, we will be fine.

glemaitre avatar Oct 07 '18 18:10 glemaitre

@glemaitre trying to make CI happy but with difficulties in sklearn exported ClassificationCriterion I saw you've merged the export in sklearn so I posted a comment for you in the relevant sklearn PR conversation #10325

EvgeniDubov avatar Oct 09 '18 06:10 EvgeniDubov

Uhm it seems fishy. When reviewing the PR there I did test on my developer environment which included this file. I think that including the file in the manifest is not enough. By checking scipy, it seems that they add it in the setup.py. I open an issue to confirm my intuition:

https://github.com/scikit-learn/scikit-learn/issues/12363

glemaitre avatar Oct 12 '18 08:10 glemaitre

@glemaitre I fixed most of the CI issues. Left with probably last single issue in appvoyer,

ImportError: No module named _check_build

Can you please help to resolve?

EvgeniDubov avatar Dec 30 '18 11:12 EvgeniDubov

@glemaitre I've synced with latest master and appveyor failed on same error in multiple location Is master broken? I don't see where my code can cause these errors. Could you please direct me how to resolve?

________________ ERROR collecting imblearn/datasets/_zenodo.py ________________ ..\imblearn\datasets_zenodo.py:60: in from sklearn.utils.fixes import makedirs E ImportError: cannot import name 'makedirs' from 'sklearn.utils.fixes' (C:\Miniconda36-x64\envs\testenv\lib\site-packages\sklearn\utils\fixes.py)

_____________ ERROR collecting imblearn/__check_build/init.py _____________ ..\imblearn_check_build_init.py:56: in from ._check_build import check_build E ModuleNotFoundError: No module named 'imblearn.__check_build._check_build'

EvgeniDubov avatar May 27 '19 06:05 EvgeniDubov

@glemaitre @chkoar I've synced with master and got the below errors in appveyor run, can you please advice

____________ ERROR collecting imblearn/__check_build/init.py _____________ ..\imblearn_check_build_init.py:56: in from ._check_build import check_build E ModuleNotFoundError: No module named 'imblearn.__check_build._check_build'

_______________ ERROR collecting imblearn/tests/test_common.py ________________ ..\imblearn_check_build_init.py:56: in from ._check_build import check_build E ModuleNotFoundError: No module named 'imblearn.__check_build._check_build'

EvgeniDubov avatar Jun 30 '19 07:06 EvgeniDubov

@glemaitre @chkoar I've synced with master and got lint, travis and appveyor issues, none of which caused by my contribution can you please take a look

EvgeniDubov avatar Dec 26 '19 06:12 EvgeniDubov

@glemaitre @chkoar My DS team is using Hellinger distance split criterion from @EvgeniDubov private repo. We would appreciate it being part of scikit-learn-contrib. We're willing to help move this PR forward in any way possible.

giladwa1 avatar Feb 17 '20 07:02 giladwa1

@giladwa1 I am not familiar with the Hellinger distance yet but if people are willing to help to get this merged I am ok even if it works only for the binary case.

chkoar avatar Feb 17 '20 07:02 chkoar

Speaking a bit more with the dev of scikit-learn, I think that it could be integrated into scikit-learn directly. It would only be for the binary case and we should have good tests and a nice example showing its benefit.

The issue in imbalanced-learn is that we will be required to code in cython and then it had a lot of burden on the wheel generation which personally I would like to avoid if possible. This somehow a cost which is a bit hidden.

On Mon, 17 Feb 2020 at 08:57, Christos Aridas [email protected] wrote:

@giladwa1 https://github.com/giladwa1 I am not familiar with the Hellinger distance yet but if people are willing to help to get this merged I am ok even if it works only for the binary case.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/imbalanced-learn/pull/437?email_source=notifications&email_token=ABY32P6QOZSEZJUMUFF755TRDI7OLA5CNFSM4FJO4VXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL5MXTQ#issuecomment-586861518, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABY32P6GMHFI2R2NKZS62OTRDI7OLANCNFSM4FJO4VXA .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

glemaitre avatar Feb 17 '20 09:02 glemaitre

The issue in imbalanced-learn is that we will be required to code in cython and then it had a lot of burden on the wheel generation which personally I would like to avoid if possible.

That is very true.

chkoar avatar Feb 17 '20 11:02 chkoar

@glemaitre @chkoar Thanks for the quick reply, I will continue the discussion in the scikit-learn PR https://github.com/scikit-learn/scikit-learn/pull/16478

giladwa1 avatar Feb 20 '20 06:02 giladwa1