mixOmics icon indicating copy to clipboard operation
mixOmics copied to clipboard

DIABLO does not account for differently ordered variables in test set

Open Ning-L opened this issue 2 years ago • 2 comments

Hi mixOmics team,

Thank you for your hard work on this great package!

I use the DIABLO pipeline to perform my multi-omics analysis. After built my final model using the multiblock sPLS-DA , I want to to predict new samples with it.

When using the predict function, I got this error message: Each 'newdata[[i]]' must include all the variables of 'object$X[[i]]', however all the variables are there in the new data set.

After checking in the source code, I found it was due the order of variables in one block of my new data list which is not exactly the same as in training data set. Once I reordered variables as how they were in training data set, everything goes well.

I think the key point is just to ensure all variables from de trained model are present in the new data, the order doesn't matter. So the checking by all.equal is not appropriate here https://github.com/mixOmicsTeam/mixOmics/blob/2b6ab0640cddb9329de723c1d194f05ee4db2a59/R/predict.R#L346

I suggest to replace the if statement by following code:

if (any(unlist(lapply(seq_along(X), function(i) length(setdiff(colnames(X[[i]]), colnames(newdata[[i]]))) > 0))))

Best, Lijiao

Ning-L avatar Mar 16 '22 16:03 Ning-L

Actually, I found that the order of variables matters here, because there is a step after that to scale the new data based on the training data as follow: https://github.com/mixOmicsTeam/mixOmics/blob/2b6ab0640cddb9329de723c1d194f05ee4db2a59/R/predict.R#L385-L389

So if all variables are present in the new data, just in a different order than in the training set, I think we can just add a step to sort them, such as 32e9ac66b7c657d30a5f276f24478e5193e13b9f

Ning-L avatar Mar 17 '22 18:03 Ning-L

For consistency, using the template to describe the bug

🐞 Describe the bug:

When using the predict() function on a DIABLO, if one or more of the test dataframes is supplied with a variable order that differs from the equivalent training dataframe, the following error is raised:

Each 'newdata[[i]]' must include all the variables of 'object$X[[i]]

While the order is important for the algorithm, having differing orders should not prevent the method from running.


🔍 reprex results from reproducible example including sessioninfo():

suppressMessages(library(mixOmics))

data(breast.TCGA) # load in the data

# extract data
X.train = list(mirna = breast.TCGA$data.train$mirna,
               mrna = breast.TCGA$data.train$mrna)

X.test = list(mirna = breast.TCGA$data.test$mirna,
              mrna = breast.TCGA$data.test$mrna)

Y.train = breast.TCGA$data.train$subtype

# use optimal values from the case study on mixOmics.org
optimal.ncomp = 2
optimal.keepX = list(mirna = c(10,5),
                     mrna = c(26, 16))

# set design matrix
design = matrix(0.1, ncol = length(X.train), nrow = length(X.train),
                dimnames = list(names(X.train), names(X.train)))
diag(design) = 0

# generate model
final.diablo.model = block.splsda(X = X.train, Y = Y.train, ncomp = optimal.ncomp, # set the optimised DIABLO model
                                  keepX = optimal.keepX, design = design)
#> Design matrix has changed to include Y; each block will be
#>             linked to Y.


# create new test data with one dataframe being reordered
new.var.order = sample(1:dim(X.test$mirna)[2])
X.test.dup <- X.test
X.test.dup$mirna <- X.test.dup$mirna[, new.var.order]

predict.diablo = predict(final.diablo.model, newdata = X.test)

predict.diablo.reordered = predict(final.diablo.model, newdata = X.test.dup)
#> Error in predict.block.spls(final.diablo.model, newdata = X.test.dup): Each 'newdata[[i]]' must include all the variables of 'object$X[[i]]'

Created on 2022-03-21 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.1.2 Patched (2021-11-16 r81220)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_Australia.1252
#>  ctype    English_Australia.1252
#>  tz       Australia/Sydney
#>  date     2022-03-21
#>  pandoc   2.14.2 @ C:/Users/Work/AppData/Local/Pandoc/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package      * version date (UTC) lib source
#>  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.1.3)
#>  BiocParallel   1.28.3  2021-12-09 [1] Bioconductor
#>  cli            3.2.0   2022-02-14 [1] CRAN (R 4.1.2)
#>  colorspace     2.0-3   2022-02-21 [1] CRAN (R 4.1.2)
#>  corpcor        1.6.10  2021-09-16 [1] CRAN (R 4.1.1)
#>  crayon         1.5.0   2022-02-14 [1] CRAN (R 4.1.2)
#>  DBI            1.1.2   2021-12-20 [1] CRAN (R 4.1.3)
#>  digest         0.6.29  2021-12-01 [1] CRAN (R 4.1.2)
#>  dplyr          1.0.8   2022-02-08 [1] CRAN (R 4.1.2)
#>  ellipse        0.4.2   2020-05-27 [1] CRAN (R 4.1.2)
#>  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.1.2)
#>  evaluate       0.15    2022-02-18 [1] CRAN (R 4.1.2)
#>  fansi          1.0.2   2022-01-14 [1] CRAN (R 4.1.2)
#>  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.1.2)
#>  fs             1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
#>  generics       0.1.2   2022-01-31 [1] CRAN (R 4.1.2)
#>  ggplot2      * 3.3.5   2021-06-25 [1] CRAN (R 4.1.2)
#>  ggrepel        0.9.1   2021-01-15 [1] CRAN (R 4.1.2)
#>  glue           1.6.2   2022-02-24 [1] CRAN (R 4.1.2)
#>  gridExtra      2.3     2017-09-09 [1] CRAN (R 4.1.2)
#>  gtable         0.3.0   2019-03-25 [1] CRAN (R 4.1.2)
#>  highr          0.9     2021-04-16 [1] CRAN (R 4.1.2)
#>  htmltools      0.5.2   2021-08-25 [1] CRAN (R 4.1.2)
#>  igraph         1.2.11  2022-01-04 [1] CRAN (R 4.1.2)
#>  knitr          1.37    2021-12-16 [1] CRAN (R 4.1.2)
#>  lattice      * 0.20-45 2021-09-22 [2] CRAN (R 4.1.2)
#>  lifecycle      1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
#>  magrittr       2.0.2   2022-01-26 [1] CRAN (R 4.1.2)
#>  MASS         * 7.3-54  2021-05-03 [2] CRAN (R 4.1.2)
#>  Matrix         1.3-4   2021-06-01 [2] CRAN (R 4.1.2)
#>  matrixStats    0.61.0  2021-09-17 [1] CRAN (R 4.1.2)
#>  mixOmics     * 6.18.1  2021-11-18 [1] Bioconductor (R 4.1.2)
#>  munsell        0.5.0   2018-06-12 [1] CRAN (R 4.1.2)
#>  pillar         1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.1.2)
#>  plyr           1.8.6   2020-03-03 [1] CRAN (R 4.1.2)
#>  purrr          0.3.4   2020-04-17 [1] CRAN (R 4.1.2)
#>  R.cache        0.15.0  2021-04-30 [1] CRAN (R 4.1.2)
#>  R.methodsS3    1.8.1   2020-08-26 [1] CRAN (R 4.1.1)
#>  R.oo           1.24.0  2020-08-26 [1] CRAN (R 4.1.1)
#>  R.utils        2.11.0  2021-09-26 [1] CRAN (R 4.1.2)
#>  R6             2.5.1   2021-08-19 [1] CRAN (R 4.1.2)
#>  rARPACK        0.11-0  2016-03-10 [1] CRAN (R 4.1.2)
#>  RColorBrewer   1.1-2   2014-12-07 [1] CRAN (R 4.1.1)
#>  Rcpp           1.0.8.2 2022-03-11 [1] CRAN (R 4.1.2)
#>  reprex         2.0.1   2021-08-05 [1] CRAN (R 4.1.2)
#>  reshape2       1.4.4   2020-04-09 [1] CRAN (R 4.1.2)
#>  rlang          1.0.2   2022-03-04 [1] CRAN (R 4.1.3)
#>  rmarkdown      2.13    2022-03-10 [1] CRAN (R 4.1.3)
#>  RSpectra       0.16-0  2019-12-01 [1] CRAN (R 4.1.2)
#>  rstudioapi     0.13    2020-11-12 [1] CRAN (R 4.1.2)
#>  scales         1.1.1   2020-05-11 [1] CRAN (R 4.1.2)
#>  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
#>  stringi        1.7.6   2021-11-29 [1] CRAN (R 4.1.2)
#>  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.1.2)
#>  styler         1.7.0   2022-03-13 [1] CRAN (R 4.1.2)
#>  tibble         3.1.6   2021-11-07 [1] CRAN (R 4.1.2)
#>  tidyr          1.2.0   2022-02-01 [1] CRAN (R 4.1.2)
#>  tidyselect     1.1.2   2022-02-21 [1] CRAN (R 4.1.2)
#>  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.1.2)
#>  vctrs          0.3.8   2021-04-29 [1] CRAN (R 4.1.2)
#>  withr          2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
#>  xfun           0.30    2022-03-02 [1] CRAN (R 4.1.2)
#>  yaml           2.3.5   2022-02-21 [1] CRAN (R 4.1.2)
#> 
#>  [1] C:/Users/Work/Documents/R/win-library/4.1
#>  [2] C:/Program Files/R/R-4.1.2patched/library
#> 
#> ------------------------------------------------------------------------------

🤔 Expected behavior:

Error should not be raised. predict() function should handle this case and be able to produce predictions.


💡 Possible solution:

Sorting the test dataframe to have variable order that matches the training dataframe

Max-Bladen avatar Mar 20 '22 23:03 Max-Bladen