SummarizedExperiment icon indicating copy to clipboard operation
SummarizedExperiment copied to clipboard

Erroneous behaviour when colData is data.frame

Open TuomasBorman opened this issue 1 year ago • 0 comments

Hello,

@catiapacifico opened an issue to our framework that is utilizing SummarizedExperiment. https://github.com/microbiome/OMA/issues/202

This is quite rare case, but I thought that it is worth to let you know since the possible solution is relatively easy (if any modification is even needed since the input of colData should be DataFrame).

The problem is that when rownames of colData are in numeric ID format (i.e., "1000", "1001"...) and the input is in data.frame format, these rownames are not tested with colnames of assay. This can cause situation where wrong metadata is assigned to wrong sample. (The same things happns also for rowData.)

This is behaviour is caused by DataFrame conversion in lines https://github.com/Bioconductor/SummarizedExperiment/blob/8df97720354bdeecaf947f574dddd09da02d55b2/R/RangedSummarizedExperiment-class.R#L111 and https://github.com/Bioconductor/SummarizedExperiment/blob/8df97720354bdeecaf947f574dddd09da02d55b2/R/RangedSummarizedExperiment-class.R#L138 for colData and rowData respectively.

When data.frame is converted into DataFrame numeric rownames are dropped

library(S4Vectors)
# Create data.frames
df <- data.frame(matrix(1:9, nrow = 3))
df2 <- df
# By default the rowname is numeric ID
rownames(df)
[1] "1" "2" "3"
# Add rownames
rownames(df) <- paste0("row", 1:3)
rownames(df2) <- 1:3
# Show rownames
rownames(df)
[1] "row1" "row2" "row3"
rownames(df2)
[1] "1" "2" "3"
# Convert into DataFrame with rownames
df <- DataFrame(df, row.names = rownames(df))
df2 <- DataFrame(df2, row.names = rownames(df2))
# Show rownames (rownames are preserved)
rownames(df)
[1] "row1" "row2" "row3"
rownames(df2)
[1] "1" "2" "3"
# Convert into DataFrame without rownames
df <- DataFrame(df)
df2 <- DataFrame(df2)
# Show rownames (numeric rownames are dropped)
rownames(df)
[1] "row1" "row2" "row3"
> rownames(df2)
NULL

Here is an example that might occur when SummarizedExperiment is constructed

library(SummarizedExperiment)
# assay data
counts <- rbind(rep(0, 3), matrix(1:9, nrow = 3))
colnames(counts) <- paste0("sample", 1:3)
# col data
colData <- data.frame(Treatment=c("X", "X", "Y"),row.names=colnames(counts))
colData <- colData[c(2,3,1,4), , drop =  FALSE ]
# col data
colData2 <- data.frame(Treatment=c("X", "X", "Y"), row.names = 1:3)
colData2 <- colData2[c(2,3,1,0), , drop = FALSE ]
# Does not work because the correspondance is checked
se <- SummarizedExperiment(assays=SimpleList(counts=counts), colData=colData)
Error in validObject(.Object) : 
  invalid class “SummarizedExperiment” object: 
    nb of cols in 'assay' (3) must equal nb of rows in 'colData' (4)
# Works but the order samples is wrong between colData vs assay
se <- SummarizedExperiment(assays=SimpleList(counts=counts2), colData=colData2)
Session info R Under development (unstable) (2022-11-24 r83383) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Linux Mint 21

Matrix products: default BLAS: /opt/R/devel/lib/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=fi_FI.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=fi_FI.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Helsinki tzcode source: system (glibc)

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] scater_1.27.2 scuttle_1.9.2 forcats_0.5.2 stringr_1.4.1
[5] dplyr_1.0.10 purrr_0.3.5 readr_2.1.3 tidyr_1.2.1
[9] tibble_3.1.8 tidyverse_1.3.2 tidySummarizedExperiment_1.9.2 patchwork_1.1.2
[13] miaViz_1.7.0 ggraph_2.1.0 ggplot2_3.4.0 mia_1.5.17
[17] testthat_3.1.5 MultiAssayExperiment_1.25.1 TreeSummarizedExperiment_2.7.0 Biostrings_2.67.0
[21] XVector_0.39.0 SingleCellExperiment_1.21.0 SummarizedExperiment_1.29.1 Biobase_2.59.0
[25] GenomicRanges_1.51.1 GenomeInfoDb_1.35.5 IRanges_2.33.0 S4Vectors_0.37.0
[29] BiocGenerics_0.45.0 MatrixGenerics_1.11.0 matrixStats_0.63.0

loaded via a namespace (and not attached): [1] splines_4.3.0 later_1.3.0 bitops_1.0-7 ggplotify_0.1.0 cellranger_1.1.0
[6] polyclip_1.10-4 reprex_2.0.2 DirichletMultinomial_1.41.0 lifecycle_1.0.3 rprojroot_2.0.3
[11] processx_3.8.0 lattice_0.20-45 MASS_7.3-58.1 backports_1.4.1 magrittr_2.0.3
[16] plotly_4.10.1 remotes_2.4.2 httpuv_1.6.6 sessioninfo_1.2.2 pkgbuild_1.3.1
[21] DBI_1.1.3 lubridate_1.9.0 pkgload_1.3.2 zlibbioc_1.45.0 rvest_1.0.3
[26] RCurl_1.98-1.9 yulab.utils_0.0.5 tweenr_2.0.2 GenomeInfoDbData_1.2.9 ggrepel_0.9.2
[31] irlba_2.3.5.1 tidytree_0.4.1 vegan_2.6-4 permute_0.9-7 DelayedMatrixStats_1.21.0
[36] codetools_0.2-18 DelayedArray_0.25.0 xml2_1.3.3 ggforce_0.4.1 tidyselect_1.2.0
[41] aplot_0.1.9 farver_2.1.1 ScaledMatrix_1.7.0 viridis_0.6.2 googledrive_2.0.0
[46] jsonlite_1.8.3 BiocNeighbors_1.17.1 decontam_1.19.0 ellipsis_0.3.2 tidygraph_1.2.2
[51] tools_4.3.0 ggnewscale_0.4.8 treeio_1.23.0 Rcpp_1.0.9 glue_1.6.2
[56] gridExtra_2.3 mgcv_1.8-41 usethis_2.1.6 withr_2.5.0 fastmap_1.1.0
[61] fansi_1.0.3 callr_3.7.3 digest_0.6.30 rsvd_1.0.5 timechange_0.1.1
[66] R6_2.5.1 mime_0.12 gridGraphics_0.5-1 colorspace_2.0-3 RSQLite_2.2.19
[71] googlesheets4_1.0.1 utf8_1.2.2 generics_0.1.3 data.table_1.14.6 DECIPHER_2.27.0
[76] prettyunits_1.1.1 graphlayouts_0.8.4 httr_1.4.4 htmlwidgets_1.5.4 pkgconfig_2.0.3
[81] gtable_0.3.1 blob_1.2.3 brio_1.1.3 htmltools_0.5.3 profvis_0.3.7
[86] scales_1.2.1 ggfun_0.0.9 rstudioapi_0.14 tzdb_0.3.0 reshape2_1.4.4
[91] nlme_3.1-160 cachem_1.0.6 parallel_4.3.0 miniUI_0.1.1.1 vipor_0.4.5
[96] desc_1.4.2 pillar_1.8.1 grid_4.3.0 vctrs_0.5.1 urlchecker_1.0.1
[101] promises_1.2.0.1 BiocSingular_1.15.0 dbplyr_2.2.1 beachmat_2.15.0 xtable_1.8-4
[106] cluster_2.1.4 beeswarm_0.4.0 cli_3.4.1 compiler_4.3.0 rlang_1.0.6
[111] crayon_1.5.2 modelr_0.1.10 ps_1.7.2 plyr_1.8.8 fs_1.5.2
[116] ggbeeswarm_0.6.0 stringi_1.7.8 viridisLite_0.4.1 BiocParallel_1.33.6 assertthat_0.2.1
[121] munsell_0.5.0 lazyeval_0.2.2 devtools_2.4.5 Matrix_1.5-3 hms_1.1.2
[126] sparseMatrixStats_1.11.0 bit64_4.0.5 shiny_1.7.3 haven_2.5.1 gargle_1.2.1
[131] igraph_1.3.5 broom_1.0.1 memoise_2.0.1 ggtree_3.7.1 bit_4.0.5
[136] readxl_1.4.1 ape_5.6-2

-Tuomas

TuomasBorman avatar Nov 29 '22 15:11 TuomasBorman