Roadmap for new R interface
ref https://github.com/dmlc/xgboost/issues/9734 ref https://github.com/dmlc/xgboost/issues/9475
This issue is intended as a roadmap tracker for progress in bringing xgboost's R interface up to date and discussions around these tasks and coordination.
From the previous tasks, here I've made a list of potential tasks to take on, but I might be missing some things, and I've put the biggest task (new xgboost() function) under a single bullet point while in practice it'll likely involve multiple rounds of PRs. Please feel free to add more tasks to this list.
I've taken the liberty of classifying these issues in terms of whether they'd be blockers for releasing a new xgboost version or not, albeit some people might disagree with my assessments.
- [x] (Blocker) Enable categorical features for current
DMatrixconstructors (matrix,dgCMatrix,dgRMatrix). - [x] (Blocker) Add support for creating DMatrices from R
data.frameobjects, automatically settingfactorvariables to be of categorical type in the DMatrix. (#9828)- Note: these objects are a list of arrays which aren't necessarily in a single memory chunk, and which can have types
int(int32_t),double(float64), and potentiallyint64_tfrom packagebit64. - I guess this and the first point could be done in the same PR since they might be touching similar code sections.
- Note: these objects are a list of arrays which aren't necessarily in a single memory chunk, and which can have types
- [ ] (Blocker) Fix plotting and trees-to-table with categorical splits.
- [x] Add
XGDMatrixNumNonMissing. - [x] Add
XGDMatrixGetDataAsCSR. - [x] (Blocker) Enable multi-output input labels and predictions.
- [ ] (Low priority) Add a mechanism to create a
DMatrixobject fromarrowobjects (from package "arrow"). Like for data frames, should automatically recognize categorical columns from the categorical arrow type.- Note: the idea here is to exploit functions that work directly on arrow format, without converting to base R arrays (which do not support all the arrow types) along the way.
- [x] Add an interface to create
QuantileDMatrixobjects from R, accepting the same kinds of inputs asDMatrix(data.frame,matrix,dgCMatrix,dgRMatrix,arrowif implemented, maybefloat::float32), and also auto-recognizing categorical features for objects that have them (data frames and arrow tables). - [x] (Low priority) Add methods to get additional info from
DMatrixobjects that are currently missing from the R package, such asget_quantile_cut(guess this is just a call toXGDMatrixGetQuantileCut?). - [x] (Blocker) Move more
DMatrixparameters that reference data towardsxgb.DMatrix()function arguments, such asqid,group,label_lower_bound,label_upper_bound, etc.- Potentially a good reference could be the DMatrix python class.
- [x] Switch the current
DMatrixcreation function for R matrices towards the C function that usesarray_interface. - [x] Switch the
predictmethod for the current booster to use "inplace predict" or other more efficientDMatrixcreators when appropriate. - [x] (Blocker) Remove all the public interface (functions, docs, tests, examples) around the
Booster.handleclass, as well as the conversion methods from handle to booster and vice-versa, leaving only the booster for now. - [x] (Blocker) After the task above is done, switch the handle serialization mechanism to ALTREP and remove
xgb.Booster.complete, which wouldn't be needed anymore.- [x] This increases the R requirement to >= 4.3, so it requires modifying the CI jobs to update them all to this version of R and drop the older ones.
- ~(Low priority) Implement serialization for
DMatrixhandles through the same ALTREP system as above.~ This idea was discarded (thread) - [x] (Blocker) Remove the current
xgboost()function, and remove the calls from all the places it gets used (tests, examples, vignettes, etc.). - [ ] (Blocker) After support for
data.frameand categorical features is added, then create a newxgboost()function from scratch that wouldn't share any code base with the current function named like that, ideally working as a higher-level wrapper overDMatrix+xgb.trainbut implementing the kind of idiomatic R interface (x/y only, no formula) described in the earlier thread, either with a separate function for the parameters or everything being passed in the main function.- It should return objects of a different class than
xgb.train(perhaps the class could be named "xgboost"). - This class should have its own
predictmethod, again with a different interface than the booster's predict, as described in the first message here. - If this class needs to keep additional attributes, perhaps they could be kept as part of the JSON that gets serialized, otherwise should have a note about serialization and transferability with other interfaces.
- This is probably the largest PR in terms of code (especially tests!!), so might need to be split into different batches. For example, support for custom objectives could be left out from the first PR.
- It should return objects of a different class than
- [ ] (Blocker) After the new
xgboost()x/y interface gets implemented, then modify other functions to accept these objects - e.g.:- Plotting function.
- Feature importance function.
- Serialization functions that are aimed at transferring models between interfaces.
- All of these should keep in mind small details like base-1 indexing for tree numbers and similar.
- [ ] (Blocker) Create examples and vignettes for the new
xgboost()function. - [ ] (Low priority) Perhaps create a higher-level cv function for the new
xgboost()interface. - [x] Support creation of external memory objects with
DataIter. - [x] (Blocker) Enable quantile regression with multiple quantiles.
- [ ] Switch the R package build system to CMake instead of autotools.
- [ ] (Low priority) Distributed training, perhaps integration with RSpark.
- [ ] Documentation and unified tests for 1-based indexing.
- [x] (Blocker) Fix misrendered documentation: https://github.com/dmlc/xgboost/issues/10329
- [ ] (Blocker) Update introductory vignette to reflect current XGBoost capabilities https://github.com/dmlc/xgboost/issues/10746
@trivialfis From the previous thread, you mentioned you might be able to work on categorical feature support - would you be able to take on the first two tasks here?
@dfsnow You mentioned that you were willing to help in the earlier topic - would you be interested in taking on some of the issues here, particularly around DMatrix topics?
@jameslamb Would you be interested in taking on some task such as removing the handle class from the public interface?
@mayer79 Are you familiar with C++ and R's C interface? Would you be able to help with some of these topics?
@david-cortes: fantastic road map, thank you so much. Unfortunately, you have spotted my biggest weakness! For the C part, we might ask the data.table team. For the C++ part, Dirk Edelbüttel?
Let me handle the primitive support for data frame first. Categorical data can follow.
Let me handle the primitive support for data frame first. Categorical data can follow.
This is probably going to help with other interfaces as well. We need to have missing data for each column.
With the amount of custom C++ code in the R package, I think we need to set up CI tests with sanitizer for R (hopefully not Valgrind, which is slow).
Another task which doesn't require modifying any C/C++ functions (only .R files): currently, xgb.cv will error out with objective survival:aft. This is due to the function checking that the DMatrix object has label property, but this objective works instead with label_lower_bound and label_upper_bound.
@mayer79 would you be interested in contributing a fix?
Good idea. I even remember this issue from somewhere.
@jameslamb Would you be interested in taking on some task such as removing the handle class from the public interface?
Yes definitely!
But it will be about 1-2 weeks until I'm able to spend any time on it, as I'm focusing right now on trying to get {lightgbm} 4.x out to CRAN (and keeping {lightgbm} from being archived there 😬 ).
I'm also happy to help with reviews on any PRs here if you want, just @ me.
Since the current master branch now supports multi-quantile regression, I guess it's now time to update the example in the docs where it says
The feature is only supported using the Python package
... and maybe it'd be worth it to add an equivalent R example, if someone would like to take on this task.
@david-cortes Out of curiosity, do you want to become the CRAN maintainer after having the new interface (regardless of whether the two interfaces coexist)? At the moment, I'm maintaining the CRAN package but only doing the chores instead of having actual development, it would be great if there's a real expert can take over.
@david-cortes Out of curiosity, do you want to become the CRAN maintainer after having the new interface (regardless of whether the two interfaces coexist)? At the moment, I'm maintaining the CRAN package but only doing the chores instead of having actual development, it would be great if there's a real expert can take over.
Thanks for the offer, but I'll pass on it as I'm not certain that I will have the time for that sort of work in the future or the ability to address CRAN issues on time.
That being said, if you ever need help with some issue in the R interface in the future, or would like to me to review some PR, feel free to tag me there if needed.
Understood, thank you for the great progress on the R package!
Added:
Documentation and unified tests for 1-based indexing.
ref: https://github.com/dmlc/xgboost/pull/9935#issuecomment-1892616474
@david-cortes Hi, out of curiosity, how's everything going?
Fine, thank you. I'll be pausing work for a while, will probably resume later in May.
Good to know, thank you for the update!