VariantSpark icon indicating copy to clipboard operation
VariantSpark copied to clipboard

Importance is slightly biased towards last variables.

Open piotrszul opened this issue 6 years ago • 1 comments

The procedure of selecting split variables in case of equal reduction in impurity is slightly biased towards variables with larger indexes. In the previous non-reproducible approach it was casused by the increased probablilly of selecting later variables. In the current one it is probably cause by not enough randomness in using XOR as hashing function. The solution is to use a better hashing function to generate a surrogate order and to vary it on only per batch and partition but also for every split. Mumur3 hashing seem to be a good candiate. Here is the code snippet:

Murmur_Snippet.txt

piotrszul avatar Apr 08 '19 05:04 piotrszul

Here is in interesting info on randomness of various hashing algorithms: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

piotrszul avatar Apr 18 '19 02:04 piotrszul

It seems to be implemented: https://github.com/aehrc/VariantSpark/commit/19509549fd18e581e3dddea56d52e7f420117157

rocreguant avatar Feb 15 '24 03:02 rocreguant