xgboost
xgboost copied to clipboard
Improve DMatrix creation performance in python
The xgboost python python package serializes numpy arrays as json. This can take up a considerable amount of time in production workloads. This patch optimizes the specific case where the numpy array is already in "C" contiguous 32-bit floating point format, and can be loaded directly without the json layer. This can improve performance up to 35% in some cases, as can be seen by the microbenchmark added in xgboost/tests/python/microbench_numpy.py:
Rows | Cols | Threads | Contiguous | Non-contiguous | Ratio
---------+----------+--------------+-----------------+-----------------+--------------
15000 | 100 | 0 | 0.01686 | 0.01988 | 84.8%
15000 | 100 | 1 | 0.02897 | 0.04424 | 65.5%
15000 | 100 | 2 | 0.02579 | 0.0392 | 65.8%
15000 | 100 | 10 | 0.01581 | 0.02058 | 76.8%
---------+----------+--------------+-----------------+-----------------+--------------
2 | 2000 | 0 | 0.001055 | 0.001205 | 87.6%
2 | 2000 | 1 | 0.0004465 | 0.0005689 | 78.5%
2 | 2000 | 2 | 0.0004609 | 0.000615 | 74.9%
2 | 2000 | 10 | 0.0005087 | 0.0005623 | 90.5%
---------+----------+--------------+-----------------+-----------------+--------------
The pull request contains updated tests as well.