xgboost [R] Use inplace predict

This PR switches the predict method to use inplace prediction when possible (dense matrix / CSR matrix as inputs).

It seems that there are quite a lot of limitations around when it is or isn't possible to use these functions which are not documented in the readthedocs page, so I am not sure if the conditions that I am adding here to check for whether inplace predict is supported are correct.

I'm also not sure what's the most efficient way of dealing with all the JSON manipulation here.

Along the way, since it is refactoring some of the earlier code introduced for processing the "missing" parameter, it also modifies the function that creates DMatrix objects from CSR/CSC. It doesn't make a practical difference there though, since currently only classes inheriting from "DsparseMatrix" (float64 values) are supported, but might come handy in the future if other sparse matrix classes (like "LsparseMatrix") become supported in the future.

Nov 30 '23 21:11 david-cortes

ref https://github.com/dmlc/xgboost/issues/9810 fixes https://github.com/dmlc/xgboost/issues/7616

Nov 30 '23 21:11 david-cortes

I see there's a failing MSVC job, but it doesn't seem to contain any test failure, only warnings. No idea what went wrong there.

Dec 01 '23 17:12 david-cortes

Always fun to have R failing on Windows when one thinks everything is ready. ;-) I restarted the CI and it seems to be failing again. I can help debug it on Monday next week.

Dec 01 '23 23:12 trivialfis

Let's hope the R update in the msvc job helps here.

Dec 06 '23 18:12 david-cortes

Let me try to reproduce it.

Dec 11 '23 01:12 trivialfis

I can reproduce the error on my machine, and somehow magically got it working by cout << res_code << endl in the new prediction function. The log from MSVC looks suspicious. Other than the error, the prediction function needs to consider base margin as well, which requires the use of proxy DMatrix. I will look into the QDM PR for the proxy implementation in R.

 C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.37.32822\include\stdarg.h(29,9): warning C4068: unknown pragma 'Rf_warning' [D:\a\xgboost\xgboost\build\R-package\xgboost-r.vcxproj]
C:\Program Files (x86)\Windows Kits\10\Include\10.0.22621.0\ucrt\corecrt_malloc.h(13,9): warning C4068: unknown pragma 'Rf_warning' [D:\a\xgboost\xgboost\build\R-package\xgboost-r.vcxproj]
C:\Program Files (x86)\Windows Kits\10\Include\10.0.22621.0\ucrt\corecrt_malloc.h(14,9): warning C4068: unknown pragma 'Rf_warning' [D:\a\xgboost\xgboost\build\R-package\xgboost-r.vcxproj]

Dec 11 '23 07:12 trivialfis

Stupid question: wouldn't setting a proxy dmatrix inside the prediction function be slower and/or consume the same amount of memory as creating a dmatrix to use for predict instead? Or am I misunderstanding the mechanism?

Dec 11 '23 18:12 david-cortes

BTW, while trying to add base margin here, I ran into an issue which manifested itself with both inplace predictions and dmatrix predictions: https://github.com/dmlc/xgboost/issues/9869

Dec 11 '23 20:12 david-cortes

wouldn't setting a proxy dmatrix inside the prediction function be slower and/or consume the same amount of memory

No, the proxy DMatrix is just a reference to the original data, so, don't free the data without first freeing the proxy dmatrix ...

Dec 12 '23 03:12 trivialfis

With hindsight, the inplace prediction can be completely implemented with proxy DMatrix without those special parameters for csr/dense, but at this point, there's no real benefit to making such breaking changes.

Dec 12 '23 03:12 trivialfis

Added base_margin as a parameter to predict, plus a function to use the in-place prediction on data frames.

Dec 12 '23 19:12 david-cortes

@trivialfis After adding support for categorical features for data frames, I just realized that passing them to predict will now require re-identifying which of those earlier factor columns were categorical, given that they need to be manually converted to base-0 indexing.

What would be the best way to retrieve that information (which columns are categorical) from the booster?

Dec 12 '23 21:12 david-cortes

@david-cortes In Python, the booster has the same feature_type parameter, search XGBoosterGetStrFeatureInfo in core.py.

Dec 13 '23 00:12 trivialfis

@trivialfis Thanks for the tip. Would be very helpful if you could investigate the failing MSVC job here so that we can merge this PR soon, as I'm finding some of the code here will be needed for the QuantileDMatrix and other PRs too.

I unfortunately don't know how best to debug an MSVC-compiled module that needs to be imported in R, as it can't use typical tools like sanitizers with a regular R build.

Dec 17 '23 22:12 david-cortes

I can try, but I admit I don't use Windows very often.

Dec 19 '23 02:12 trivialfis

Is this PR independent of #9902?

I use Windows as a daily driver, so should be able to help debug the error in this PR.

Dec 20 '23 06:12 hcho3

Is this PR independent of https://github.com/dmlc/xgboost/pull/9902?

yes.

Dec 20 '23 06:12 trivialfis

Tried to use the sanitizer bundled in vs, but ran into a compiler internal error ...

Jan 02 '24 07:01 trivialfis

Tried to use the sanitizer bundled in vs, but ran into a compiler internal error ...

Does that compiler error still happen if you comment out the R-specific code introduced here?

Jan 02 '24 18:01 david-cortes

The compiler error with sanitizer was about a different file in XGB (refresher), I will try again tomorrow.

Jan 02 '24 18:01 trivialfis

I have made no progress on the Windows error, we will have to delay this PR for a while.

Jan 15 '24 09:01 trivialfis

Discussion with @hcho3 , we are likely to move away from MSVC for the R package, along with which, we will drop support for building R package with GPU on Windows using MSVC. We haven't ruled out the possibility of using rtools and CMake to build GPU support, in addition, recommending WSL2 is a potential option as well.

Jan 16 '24 08:01 trivialfis

Discussion with @hcho3 , we are likely to move away from MSVC for the R package, along with which, we will drop support for building R package with GPU on Windows using MSVC. We haven't ruled out the possibility of using rtools and CMake to build GPU support, in addition, recommending WSL2 is a potential option as well.

Is that just for the R package, or also for the python package for windows? If just for R, have there been MSVC-specific issues like this before? Curious about what makes it problematic.

Jan 16 '24 17:01 david-cortes

It's just the R package,mostly because I couldn't figure out the compiler warning and the error in this PR. So far, it seems MSVC is not handling the C headers from R very well.

Jan 16 '24 18:01 trivialfis

Could you please help rebase the PR?

Jan 20 '24 16:01 trivialfis

Updated, but I'm now not 100% sure if everything is correct.

I suddenly started getting an error in this test and the one right after it after merging the latest master: https://github.com/dmlc/xgboost/blob/c5d0608057f3c0b335fd31594c65a3a3ca4afca3/R-package/tests/testthat/test_basic.R#L142

Before, it was using ntreelimit, which I guess meant that the code was converting to DMatrix, while now it's actually using inplace predict.

It looks like predictions from in-place match those from DMatrix and from xgb.train up to only about ~1e-7 in this particular case (dart booster), while elsewhere they seem to match closer.

BTW, would be ideal if you could merge the PR about data iterators before this one: https://github.com/dmlc/xgboost/pull/9913

Jan 20 '24 19:01 david-cortes

Updated, but I'm now not 100% sure if everything is correct.

Please let me know how things are going. ;-) Is there anything I can help?

Feb 18 '24 20:02 trivialfis

Updated, but I'm now not 100% sure if everything is correct.

Please let me know how things are going. ;-) Is there anything I can help?

PR is ready for review. Would be ideal if you could look into this issue with loss of floating point precision for dart. Since the difference between inplace and dmatrix is below $10^{-6}$ I guess it might not be a correctness problem, but it's odd that it gets a larger difference with dart than with gbtree.

Feb 19 '24 18:02 david-cortes

Thank you for the work on inplace prediction! The R package is looking much better now thanks to your great work.

Feb 23 '24 16:02 trivialfis

xgboost xgboost copied to clipboard

[R] Use inplace predict

xgboost
xgboost copied to clipboard