[RFC] Making R interface more idiomatic
I notice that there is a version 2.0 of xgboost in the plans, which among other things, is expected to include support for categorical features in the R interface.
Given that this is a major version release and as such is expected to introduce potentially breaking changes, I think this is a good opportunity to make the R interface more in line with base R and core/popular R modeling packages. Many people (including myself) find the R interface of xgboost to be inconvenient and unidiomatic, but changing the interface for xgboost() from its current state would be a rather big breaking change and would probably break lots of user scripts that depend on xgboost().
In short, xgboost() does not work with the most common data types used in R (data.frame) and does not follow R conventions in terms of e.g. function arguments. For people who are familiar with base R and with other R packages, there are many ways in which the R interface of xgboost could be improved for a better end-user experience, such as:
- Offering an x/y interface as well as a formula interface.
- Accepting data frames as inputs and handling categorical/factor variables from data frames.
- Accepting
factorvariables as "y". - Accepting non-standard evaluation for column names (e.g. passing the weight variable as a column name without quotes).
- Using base-1 numeration for integers as R does instead of base-0.
- Controlling prediction types through a
typeargument. - Making the naming of function arguments more consistent with base R and core packages - for example, naming the weights as
weightsinstead ofweight, like base R does. - Changing default arguments by, for example, not dumping the model to a file in disk by default.
Among many others.
Would this project accept big breaking PRs for the R interface (particularly for xgboost() and predict.xgb.Booster()) for the 2.0 release that would make it more similar to base R and other R packages?
I don't think any current active maintainers are big R users so we welcome input. Could we just build a new interface behind a different namespace until it's ready? I don't think there's a need to immediate replace the old interface in a short space of time.
Would this project accept big breaking PRs for the R interface (particularly for xgboost() and predict.xgb.Booster()) for the 2.0 release that would make it more similar to base R and other R packages?
I would like to welcome these changes. The concern about breaking changes can be handled by running reverse dependency checks.
I suggest to keep xgboost() and predict() as they are and instead call the new functions differently, e.g. xgboost2() and predict2(). Too much code would break when changing the main functions.
Otherwise, great work @david-cortes.