complex-upset icon indicating copy to clipboard operation
complex-upset copied to clipboard

General performance in big(ish) data

Open 16mc1r opened this issue 2 years ago • 6 comments

Is your feature request related to a problem? Please describe. Using upsetR i could get plots from big(ish) data with ~5 million x 600 sized tables. complex upset yields no result (within time i was willing to wait). Yes the dimensions are silly, but I cannot change the data structure or complexity I get.

Describe the solution you'd like Without deep knowledge how the interaction sets are computed a solution based on data.table or matrix permutation. Possibly a way to provide pre-computed interaction matrices, or interaction sets.

Describe alternatives you've considered Keep using upsetR, maybe making my on version which lets me deliever pre computed matrices to plot.

Context (required) ComplexUpset version: x.x.x ‘1.3.1’

R version details
$platform
[1] "x86_64-w64-mingw32"

$arch
[1] "x86_64"

$os
[1] "mingw32"

$system
[1] "x86_64, mingw32"

$status
[1] ""

$major
[1] "4"

$minor
[1] "0.5"

$year
[1] "2021"

$month
[1] "03"

$day
[1] "31"

$`svn rev`
[1] "80133"

$language
[1] "R"

$version.string
[1] "R version 4.0.5 (2021-03-31)"

$nickname
[1] "Shake and Throw"
R session information
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] odbc_1.3.2        openxlsx_4.2.4    DBI_1.1.1        
 [4] tidylog_1.0.2     skimr_2.1.3       arrow_5.0.0      
 [7] tarchetypes_0.2.1 targets_0.6.0     data.table_1.14.0
[10] aokaux_0.6.1      here_1.0.1        glue_1.4.2       
[13] cyphr_1.1.2       keyring_1.2.0     ggsci_2.9        
[16] naniar_0.6.1      cowplot_1.1.1     Hmisc_4.5-0      
[19] Formula_1.2-4     survival_3.2-10   lattice_0.20-41  
[22] janitor_2.1.0     kableExtra_1.3.4  knitr_1.33       
[25] datapasta_3.1.0   forcats_0.5.1     stringr_1.4.0    
[28] dplyr_1.0.7       purrr_0.3.4       readr_1.4.0      
[31] tidyr_1.1.3       tibble_3.1.2      ggplot2_3.3.5    
[34] tidyverse_1.3.1   pacman_0.5.1     

loaded via a namespace (and not attached):
 [1] colorspace_2.0-2    ellipsis_0.3.2      visdat_0.5.3       
 [4] rprojroot_2.0.2     snakecase_0.11.0    htmlTable_2.2.1    
 [7] base64enc_0.1-3     fs_1.5.0            rstudioapi_0.13    
[10] farver_2.1.0        bit64_4.0.5         fansi_0.4.2        
[13] lubridate_1.7.10    xml2_1.3.2          codetools_0.2-18   
[16] splines_4.0.5       jsonlite_1.7.2      broom_0.7.8        
[19] cluster_2.1.1       dbplyr_2.1.1        png_0.1-7          
[22] compiler_4.0.5      httr_1.4.2          tictoc_1.0.1       
[25] backports_1.2.1     assertthat_0.2.1    Matrix_1.3-2       
[28] cli_3.0.1           htmltools_0.5.1.1   tools_4.0.5        
[31] igraph_1.2.6        gtable_0.3.0        Rcpp_1.0.7         
[34] cellranger_1.1.0    vctrs_0.3.8         svglite_2.0.0      
[37] xfun_0.22           ps_1.6.0            rvest_1.0.0        
[40] lifecycle_1.0.0     scales_1.1.1        clisymbols_1.2.0   
[43] hms_1.1.0           parallel_4.0.5      sodium_1.1         
[46] RColorBrewer_1.1-2  yaml_2.2.1          gridExtra_2.3      
[49] UpSetR_1.4.0        ggplot2movies_0.0.1 rpart_4.1-15       
[52] latticeExtra_0.6-29 stringi_1.6.2       checkmate_2.0.0    
[55] zip_2.1.1           repr_1.1.3          rlang_0.4.11       
[58] pkgconfig_2.0.3     systemfonts_1.0.1   evaluate_0.14      
[61] patchwork_1.1.1     labeling_0.4.2      htmlwidgets_1.5.3  
[64] bit_4.0.4           tidyselect_1.1.1    processx_3.5.1     
[67] plyr_1.8.6          magrittr_2.0.1      R6_2.5.0           
[70] generics_0.1.0      pillar_1.6.1        haven_2.4.1        
[73] foreign_0.8-81      withr_2.4.2         nnet_7.3-15        
[76] modelr_0.1.8        crayon_1.4.1        utf8_1.2.1         
[79] rmarkdown_2.9       jpeg_0.1-8.1        grid_4.0.5         
[82] readxl_1.3.1        blob_1.2.1          callr_3.7.0        
[85] reprex_2.0.0        digest_0.6.27       webshot_0.5.2      
[88] ComplexUpset_1.3.1  munsell_0.5.0       viridisLite_0.4.0  

16mc1r avatar Aug 11 '21 09:08 16mc1r

Would you like to only plot the bars, or would you want to add more components/annotations? What is the time you are willing to wait? See some recent discussion here: https://github.com/krassowski/complex-upset/issues/133#issuecomment-895155554

krassowski avatar Aug 11 '21 12:08 krassowski

Just the bars would be fine, its just a way to identifiy relevant subsets. Time: difficult to say, 2min max? This is for "interactive" exploration. RAM is usually not a problem, standard is 128GB but up to 500GB are available if necessary.

16mc1r avatar Aug 11 '21 13:08 16mc1r

I would also like to see a performance improvement for bar charts (upset plots). I can wait hours or even days, but the premise is the calculation can be fit into my memory (up to 300GB).

tomleung1996 avatar Aug 13 '21 06:08 tomleung1996

We could have a switch to only compute and plot the the summary statistics instead of individual points, which should give you a substantial performance and memory-use improvement. I will look into it next week.

krassowski avatar Aug 13 '21 12:08 krassowski

We could have a switch to only compute and plot the the summary statistics instead of individual points, which should give you a substantial performance and memory-use improvement. I will look into it next week.

Hi, Krassowski. Hope you had a nice weekend!

I came up with a possible solution to the performance problem but don't know if it would be too hard to implement.

In the original input file (e.g. movies), users can choose to aggregate the sample to the combinations of sets by themselves. This can be done by adding an extra column indicating the "weights" or "number of members". For example:

ID Set1 Set2 Weight
1 TRUE FALSE 10
2 TRUE TRUE 2
3 FALSE TRUE 5

(Rows are distinct)

In this case, users can have more control of their combinations (as well as memory usage) and ComplexUpset is only responsible for plotting.

I am not sure if this violates the original behavior of ComplexUpset. Please ignore this comment if it involves a heavy workload. I appreciate the efforts you made to this wonderful project and all the help from you!

Thank you!

tomleung1996 avatar Aug 16 '21 03:08 tomleung1996

I finally managed to get my desired plots. If you only want to show percentages and can calculate the numbers by yourselves, you can generate a much smaller sample with the same distribution to get the exact same upset plot.

tomleung1996 avatar Aug 18 '21 02:08 tomleung1996