feature_engine
feature_engine copied to clipboard
[New Transformer] ChiMergeDiscretisor
Closes #450.
Notes from #450:
Existing implementations:
https://github.com/lisette-espin/pychimerge https://github.com/night18/ChiMerge https://github.com/raiyan1102006/ChiMerge https://gist.github.com/alanzchen/17d0c4a45d59b79052b1cd07f531689e?short_path=f2e54c6
Reference to the original article can be found in the first link
Hi @Morgan-Sell
Thanks for kicking this off.
I reckon this one is not ready for review, right?
It would be great to have some tests with the expected result for the transformer.
Thank you!
hola @solegalli,
Correct, it's not ready for review. I'm still working on it. And, yes, I'll create tests.
@solegalli,
Should we avoid using dataframes and instead use dictionaries and numpy arrays? I suspect iterating through dataframes increases computational costs.
I'm going to keep the question, but I think the answer is "yes". Dictionaries and numpy arrays simplify the merging of frequency distributions ;)
hola @solegalli,
I think I need an extra set of eyes. I'm struggling to identify what is causing the error for test_chi_merge(). I believe the root cause is in _calc_chi_square(); however, I cannot identify where.
In test_chi_merge(), the expected results are [4.1, 2.4, 8.6, 2.9, 1.7, 1.8, 2.2, 4.8, 4.1, 3.2, 1.5, 3.6]
. These are the results stated in the Chi-Merge paper.
The transformer returns the following results: [5.9, 5.2, 8.6, 2.9, 1.8, 1.8, 3.1, 4.8, 4.1, 3.2, 1.8, 13.0]
.
The above values are the results from the chi-square tests of the consecutive distributions. 5 of the 12 expected results are incorrect.
Indices of the values that don't reconcile: 0, 1, 6, 10, and 11.
Do you see the bug?
hola @solegalli,
Did you have a chance to look at this bug? I'm stumped.
hi @solegalli,
Are you still getting around to reviewing this discretizer? I think it's super cool!
I know you're quite busy. I'm trying to organize myself.
Still pending. I send you an email? would that work?