xgboost [R] Remove `reshape` argument in `predict`

ref https://github.com/dmlc/xgboost/issues/9810

By default, the predict method will return outputs in a vector having row-major order. This is quite unintuitive in R, since matrices and arrays follow column-major order, thus it's rather unlikely that a user would want to use the prediction output as-is if it comes as row-major.

This PR changes the default instead to reshape=TRUE, which in the case of multi-column objectives, will return an R matrix with the dimensions that a user would expect.

What's more, if a user were looking to avoid computational inefficiencies by avoiding the data transpose from row-major to column-major, xgboost also provides a strict_shape parameter which has that effect with a more clearly signaled output.

I am nevertheless not sure that the current mix between reshape and strict_shape as mutually exclusive parameters is ideal - perhaps there could be a parameter strict_shape that would determine whether vectors are reshaped to matrices when they are 1D but have no effect for 2D and higher, and a parameter rows_as_first_dim or similar that would control whether the output shape should have rows as the first or last dimension.

May 25 '24 20:05 david-cortes

Maybe we should consolidate them into a single parameter and pick a default?

May 26 '24 06:05 trivialfis

Thing is, they have rather different usages:

A user might want to ensure that the output from predict will always be the same class, so that predictions from a unidimensional objective would still be a matrix but with only one column, in order to pass that to some further framework that would require this logic.
A user might want to make the predictions as fast as possible, for which outputting the data in inverse order of dimensions would be the solution in order to accommodate R's column-major order.

But I'm not sure what should be the ideal arguments to offer to control this sort of functionalities.

May 26 '24 15:05 david-cortes

Hmm, my personal preference is something that aligns with R and is consistent. Performance is great, but not more important than consistency. I don't know what are the best practices established by other frameworks. I would choose the default prediction to return a consistent shape as stated in https://github.com/dmlc/xgboost/blob/master/doc/prediction.rst , in a format that aligns with R (say, column-major). The type is a bit of a headache, maybe we use a matrix when possible, an array when the dimension is higher than 2?

May 27 '24 02:05 trivialfis

The type is a bit of a headache, maybe we use a matrix when possible, an array when the dimension is higher than 2?

The object class for predictions with more than one entry per row should not be an issue, because all 2D arrays in R are both arrays and matrices:

is.matrix(array(dim=c(2,3)))

[1] TRUE

class(matrix(0, nrow=2, ncol=3))

[1] "matrix" "array"

What I'd say is an issue is to return a 2D object as a 1D vector instead of giving it dimensions, which is the current situation in the master branch.

I don't know what are the best practices established by other frameworks.

For the most part, other frameworks (including base R's stats module) behave the same as when passing reshape=TRUE here. Some packages might return the equivalent of strict_shape=TRUE but they are in the minority. I'm not aware of any package that would allow configuring the output shape like this though.

May 27 '24 16:05 david-cortes

I'm not aware of any package that would allow configuring the output shape like this though.

We can remove it as well. Simplicity seems important.

What I'd say is an issue is to return a 2D object as a 1D vector instead of giving it dimensions

If we can avoid doing that with this PR, we can always return a matrix with dims instead of a numeric vector. Thank you for sharing that matrix and array are not that different in R. I think that's good news for us, we can use it consistently and avoid other types.

May 28 '24 07:05 trivialfis

@trivialfis After a more thorough reading, I'm thinking the following could be a better alternative:

Pass the strict_shape parameter from the user to the C-level prediction function instead of always setting it to TRUE, and use the dimensions as they come from it (in reverse order, since they are row-major).
Add an option to output the result as-is from the step above.
If the result has more than one dimension, transpose (or permute dimensions in reverse order if more than 2D) the output as it comes from the C-level function in order to translate that into an R array (column-major) of the same original dimensions.
Potentially, keep a reshape parameter (FALSE by default) to control whether to split multi-class and multi-target into lists or not. Otherwise output the result from the previous step.

@mayer79 what are your thoughts on this?

May 28 '24 19:05 david-cortes

@david-cortes : I would support this, very good idea.

With predcontrib = TRUE or predinteraction = TRUE, I think it is convenient to allow the user to split output dimensions into a list. Maybe even keep this behavior as the default?

May 29 '24 06:05 mayer79

Redid the PR so as to let the C++ core library handle the shape, allowing the user to pass strict_shape and avoid_transpose. The shapes should now always match with the python interface, as long as one doesn't pass avoid_transpose.

As part of the changes, I realized now most of the custom R prediction code is no longer needed so I removed it, alongside with removing the reshape argument. Also added a warning when the user supplies unused prediction parameters under ..., which catched a few tests that were passing additional arguments from previous code that was also removed.

There's one thing that I wasn't very sure about though: there's a function xgb.shap.data which does some custom reshaping inside it. It seems to be an internal-only function, but it appears it is meant to take on objects that could have come from outside XGBoost. Not sure if it needs additional changes after this PR.

Jun 02 '24 19:06 david-cortes

Looks great! Not sure about xgb.shap.data() as well. I don't think it is heavily used though.

One comment: Why do we use .Call(XGSetArrayDimNamesInplace_R, arr, dim_names) instead of a simple

dimnames(arra) <- dim_names

?

According to Hadley's book http://adv-r.had.co.nz/Functions.html:

"(primitive replacement functions don’t have to make copies)"

In our case:

Instead of

if ((predcontrib || predinteraction) && !is.null(colnames(newdata))) {
  cnames <- c(colnames(newdata), "(Intercept)")
  dim_names <- vector(mode = "list", length = length(dim(arr)))
  dim_names[[1L]] <- cnames
  if (predinteraction) dim_names[[2L]] <- cnames
  .Call(XGSetArrayDimNamesInplace_R, arr, dim_names)
}

we could write

if ((predcontrib || predinteraction) && !is.null(colnames(newdata))) {
  rownames(arr) <- cnames <- c(colnames(newdata), "(Intercept)")
  if (predinteraction) 
    colnames(arr) <- cnames
}

Jun 03 '24 12:06 mayer79

@mayer79 That tutorial is outdated.

Jun 03 '24 16:06 david-cortes

@mayer79 That tutorial is outdated.

Here is the new edition:

https://adv-r.hadley.nz/functions.html

According to that text, "replacement functions actually create a modified copy". Which shows that my comment was wrong!

Jun 04 '24 07:06 mayer79

Updated.

Aug 18 '24 08:08 david-cortes