xgboost
xgboost copied to clipboard
Optimise histogram kernels
Performed loop unrolling and change compressed iterator to use byte aligned sizes, increasing global memory read throughput.
max_depth=8
dataset | master | hist |
---|---|---|
airline | 89.51209751 | 83.09268917 |
bosch | 12.62905315 | 13.52083097 |
covtype | 17.99281998 | 15.88525812 |
epsilon | 44.71274849 | 39.46799638 |
fraud | 1.29335506 | 1.161479132 |
higgs | 17.27792022 | 15.09929334 |
year | 6.953637654 | 4.075826511 |
There was a discussion about the block size/kernel size being too large and many threads are wasted in the histogram kernel on latest architecture. Did you get a chance to look into that?
Thanks for the reminder. Maybe I should test on Ampere to check that I haven't reintroduced that issue. I think the number of blocks launched should be even smaller in this PR, but I should check.
Here is the A100 benchmark. Everything looks good.
dataset | master | hist |
---|---|---|
airline | 65.77564727 | 60.79124835 |
bosch | 13.05801762 | 13.36868745 |
covtype | 20.95157623 | 14.26051986 |
epsilon | 47.79153186 | 48.37207412 |
fraud | 1.514388888 | 1.128341728 |
higgs | 14.98636844 | 10.8116073 |
year | 4.462064292 | 4.655418076 |
Please convert it to non-draft so that we can run tests on Jenkins.
Unfortunately using aligned byte sizes in the compressed iterator increased the memory usage of the large sizes test by 1gb and I think it barely no longer fits on the T4 we use in CI.
The memory used by DeviceQuantileDMatrix in the test went from ~12GB to ~13GB which I think is acceptable, its just slightly annoying the test can't run on these machines.
Seems odd though, I think the memory usage bottleneck is on sketching instead of ellpack.
I reverted the changes to compressed iterator. In the test for large sizes the bit packed version to is able to use 10 bits per symbol where the aligned version uses 16. The page size is 2484Mb vs 4294Mb.
Speed seems better actually in some cases with bit packing compression.
Benchmarking results:
dataset | Without compression | With compression |
---|---|---|
airline | 60.79124835 | 59.46090026 |
bosch | 13.36868745 | 13.11017834 |
covtype | 14.26051986 | 14.26651251 |
epsilon | 48.37207412 | 37.74572848 |
fraud | 1.128341728 | 1.119378205 |
higgs | 10.8116073 | 10.70780578 |
year | 4.655418076 | 4.122201498 |