oneDAL icon indicating copy to clipboard operation
oneDAL copied to clipboard

[bug] fix MT2203 RNG non-uniformity and random bin indices in decision forest training

Open icfaust opened this issue 7 months ago • 8 comments

Description

Each MT2203 RNG engine is independently uniform when taking samples. However, when two or more engines are compared, the initial aggregated random numbers are not uniform. Because the randomness between trees needs to be guaranteed for the decision forest algorithm (each with its own RNG engine), a imperceptible performance loss is introduced to burn RNG values to where the engine collection is empirically uniform.

The second issue is a problem with the binary search associated with finding a split for ExtraTrees (regressor and classifier). The search failed to find the largest bin left edge in its current orientation, and so it has been switched to always guarantee a valid split. This change comes from the ambiguity of using a binning approach with the Extra Trees algorithm definition. All use of the .min parameter are removed, and so it is completely removed from IndexedFeatures and initial binning scripts.

This will fix the following deselected_tests from sklearnex: tests/test_multioutput.py::test_classifier_chain_tuple_order ensemble/tests/test_forest.py::test_distribution

However, this is changing the determinism of the trees used in the sklearnex tests, which means some tests which passed by chance could now fail.

This non-uniformity negatively impacts both the random forest in the bootstrapping process, and in extra trees in the initial chosen splits.

Changes proposed in this pull request:

  • Check for a family engine (only MT2203)
  • Burn a magic number of samples for every engine (400)
  • remove .min() from IndexedFeatures
  • change binary search in genRandomBinIdx for classification and regression

icfaust avatar Nov 21 '23 12:11 icfaust