xgboost Optimise histogram kernels

Performed loop unrolling and change compressed iterator to use byte aligned sizes, increasing global memory read throughput.

max_depth=8

dataset	master	hist
airline	89.51209751	83.09268917
bosch	12.62905315	13.52083097
covtype	17.99281998	15.88525812
epsilon	44.71274849	39.46799638
fraud	1.29335506	1.161479132
higgs	17.27792022	15.09929334
year	6.953637654	4.075826511

Jul 26 '22 15:07 RAMitchell

There was a discussion about the block size/kernel size being too large and many threads are wasted in the histogram kernel on latest architecture. Did you get a chance to look into that?

Jul 26 '22 17:07 trivialfis

Thanks for the reminder. Maybe I should test on Ampere to check that I haven't reintroduced that issue. I think the number of blocks launched should be even smaller in this PR, but I should check.

Jul 27 '22 09:07 RAMitchell

Here is the A100 benchmark. Everything looks good.

dataset	master	hist
airline	65.77564727	60.79124835
bosch	13.05801762	13.36868745
covtype	20.95157623	14.26051986
epsilon	47.79153186	48.37207412
fraud	1.514388888	1.128341728
higgs	14.98636844	10.8116073
year	4.462064292	4.655418076

Jul 29 '22 09:07 RAMitchell

Please convert it to non-draft so that we can run tests on Jenkins.

Aug 01 '22 07:08 trivialfis

Unfortunately using aligned byte sizes in the compressed iterator increased the memory usage of the large sizes test by 1gb and I think it barely no longer fits on the T4 we use in CI.

The memory used by DeviceQuantileDMatrix in the test went from ~12GB to ~13GB which I think is acceptable, its just slightly annoying the test can't run on these machines.

Aug 03 '22 11:08 RAMitchell

Seems odd though, I think the memory usage bottleneck is on sketching instead of ellpack.

Aug 03 '22 14:08 trivialfis

I reverted the changes to compressed iterator. In the test for large sizes the bit packed version to is able to use 10 bits per symbol where the aligned version uses 16. The page size is 2484Mb vs 4294Mb.

Speed seems better actually in some cases with bit packing compression.

Benchmarking results:

dataset	Without compression	With compression
airline	60.79124835	59.46090026
bosch	13.36868745	13.11017834
covtype	14.26051986	14.26651251
epsilon	48.37207412	37.74572848
fraud	1.128341728	1.119378205
higgs	10.8116073	10.70780578
year	4.655418076	4.122201498

Aug 11 '22 11:08 RAMitchell

xgboost xgboost copied to clipboard

Optimise histogram kernels

xgboost
xgboost copied to clipboard