VariantSpark
VariantSpark copied to clipboard
Importance is slightly biased towards last variables.
The procedure of selecting split variables in case of equal reduction in impurity is slightly biased towards variables with larger indexes. In the previous non-reproducible approach it was casused by the increased probablilly of selecting later variables. In the current one it is probably cause by not enough randomness in using XOR as hashing function. The solution is to use a better hashing function to generate a surrogate order and to vary it on only per batch and partition but also for every split. Mumur3 hashing seem to be a good candiate. Here is the code snippet:
Here is in interesting info on randomness of various hashing algorithms: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
It seems to be implemented: https://github.com/aehrc/VariantSpark/commit/19509549fd18e581e3dddea56d52e7f420117157