xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

Improve DMatrix creation performance in python

Open arieleiz opened this issue 1 year ago • 6 comments

The xgboost python python package serializes numpy arrays as json. This can take up a considerable amount of time in production workloads. This patch optimizes the specific case where the numpy array is already in "C" contiguous 32-bit floating point format, and can be loaded directly without the json layer. This can improve performance up to 35% in some cases, as can be seen by the microbenchmark added in xgboost/tests/python/microbench_numpy.py:

Rows     | Cols     | Threads      | Contiguous      | Non-contiguous  | Ratio
---------+----------+--------------+-----------------+-----------------+--------------
   15000 |      100 |            0 |         0.01686 |         0.01988 |        84.8%
   15000 |      100 |            1 |         0.02897 |         0.04424 |        65.5%
   15000 |      100 |            2 |         0.02579 |          0.0392 |        65.8%
   15000 |      100 |           10 |         0.01581 |         0.02058 |        76.8%
---------+----------+--------------+-----------------+-----------------+--------------
       2 |     2000 |            0 |        0.001055 |        0.001205 |        87.6%
       2 |     2000 |            1 |       0.0004465 |       0.0005689 |        78.5%
       2 |     2000 |            2 |       0.0004609 |        0.000615 |        74.9%
       2 |     2000 |           10 |       0.0005087 |       0.0005623 |        90.5%
---------+----------+--------------+-----------------+-----------------+--------------

The pull request contains updated tests as well.

arieleiz avatar Jun 10 '24 20:06 arieleiz