xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

Optimise histogram kernels

Open RAMitchell opened this issue 2 years ago • 6 comments

Performed loop unrolling and change compressed iterator to use byte aligned sizes, increasing global memory read throughput.

max_depth=8

dataset master hist
airline 89.51209751 83.09268917
bosch 12.62905315 13.52083097
covtype 17.99281998 15.88525812
epsilon 44.71274849 39.46799638
fraud 1.29335506 1.161479132
higgs 17.27792022 15.09929334
year 6.953637654 4.075826511

RAMitchell avatar Jul 26 '22 15:07 RAMitchell

There was a discussion about the block size/kernel size being too large and many threads are wasted in the histogram kernel on latest architecture. Did you get a chance to look into that?

trivialfis avatar Jul 26 '22 17:07 trivialfis

Thanks for the reminder. Maybe I should test on Ampere to check that I haven't reintroduced that issue. I think the number of blocks launched should be even smaller in this PR, but I should check.

RAMitchell avatar Jul 27 '22 09:07 RAMitchell

Here is the A100 benchmark. Everything looks good.

dataset master hist
airline 65.77564727 60.79124835
bosch 13.05801762 13.36868745
covtype 20.95157623 14.26051986
epsilon 47.79153186 48.37207412
fraud 1.514388888 1.128341728
higgs 14.98636844 10.8116073
year 4.462064292 4.655418076

RAMitchell avatar Jul 29 '22 09:07 RAMitchell

Please convert it to non-draft so that we can run tests on Jenkins.

trivialfis avatar Aug 01 '22 07:08 trivialfis

Unfortunately using aligned byte sizes in the compressed iterator increased the memory usage of the large sizes test by 1gb and I think it barely no longer fits on the T4 we use in CI.

The memory used by DeviceQuantileDMatrix in the test went from ~12GB to ~13GB which I think is acceptable, its just slightly annoying the test can't run on these machines.

RAMitchell avatar Aug 03 '22 11:08 RAMitchell

Seems odd though, I think the memory usage bottleneck is on sketching instead of ellpack.

trivialfis avatar Aug 03 '22 14:08 trivialfis

I reverted the changes to compressed iterator. In the test for large sizes the bit packed version to is able to use 10 bits per symbol where the aligned version uses 16. The page size is 2484Mb vs 4294Mb.

Speed seems better actually in some cases with bit packing compression.

Benchmarking results:

dataset Without compression With compression
airline 60.79124835 59.46090026
bosch 13.36868745 13.11017834
covtype 14.26051986 14.26651251
epsilon 48.37207412 37.74572848
fraud 1.128341728 1.119378205
higgs 10.8116073 10.70780578
year 4.655418076 4.122201498

RAMitchell avatar Aug 11 '22 11:08 RAMitchell