fgsea icon indicating copy to clipboard operation
fgsea copied to clipboard

collapsePathways() errors in padj threshold-dependent fashion after successful fgsea() run

Open ajw2329 opened this issue 6 years ago • 7 comments

Hello,

First of all, thanks for writing/maintaining such a fantastic package! I have recently encountered an issue with the collapsePathways() function in fgsea 1.6.0, in which I get the following error after a successful fgsea() run: "Error in FUN(X[[i]], ...) : GSEA statistic is not defined when all genes are selected"

I am running collapsePathways() as described in the vignette (see below) though filtering my input fgseaRes df such that padj < 0.05 rather than 0.01. On some datasets this has worked fine but on others I get the above error. Interestingly, lowering the padj threshold down to 0.01 alleviates the problem. Any help understanding this would be much appreciated!

I've added some code below showing my calls, error messages, and sessionInfo(). The .gmt file was downloaded from MSigDB. I'd also be happy to email a file containing the gene ranking for proper reproducibility if that would be helpful.

Thanks again! Andrew

head(ranked_genes)
   KLHL1   SLC6A5     LBX1    LRP1B    GLRA2   TRIM67 
4021.458 3393.911 3320.055 2859.702 2844.379 2825.953
bp_go <- gmtPathways("/home/andrew/projects/daily_log/09292018/c5.bp.v6.2.symbols.gmt")

fgseaRes <- fgsea(pathways = bp_go, 
                  stats = ranked_genes,
                  minSize=15,
                  maxSize=1000,
                  nperm=100000)

fgsea runs normally as far as I can tell - only warning me about some ties:

Warning message: In fgsea(pathways = bp_go, stats = ranked_genes, minSize = 15, maxSize = 1000, : There are ties in the preranked stats (0.18% of the list). The order of those tied genes will be arbitrary, which may produce unexpected results.

The results are expected for my data and give a number of significant categories:

fgseaRes %>% arrange(pval) %>% select(pathway, pval, padj, ES, NES, nMoreExtreme, size) %>% head()
pathway pval padj ES NES nMoreExtreme size
GO_NEUROGENESIS 1.00E-05 0.006736106 0.7743695 1.315848 0 960
GO_CENTRAL_NERVOUS_SYSTEM_DEVELOPMENT 1.00E-05 0.006736106 0.8157949 1.384007 0 615
GO_REGULATION_OF_NERVOUS_SYSTEM_DEVELOPMENT 1.00E-05 0.006736106 0.7999929 1.355705 0 523
GO_HEAD_DEVELOPMENT 1.01E-05 0.006736106 0.8081064 1.369204 0 513
GO_CELL_CELL_SIGNALING 2.04E-05 0.009308852 0.8186953 1.378237 1 344
GO_SYNAPTIC_SIGNALING 2.08E-05 0.009308852 0.8530924 1.420684 1 229
collapsedPathways <- collapsePathways(fgseaRes[order(pval)][padj < 0.05], 
                                      bp_go, ranked_genes)

Running collapsePathways() as above, though, gives the below error. Running it with padj < 0.01 instead eliminates the error and returns a result (though still producing the tie warnings).

Error in FUN(X[[i]], ...) : 
  GSEA statistic is not defined when all genes are selected
In addition: Warning messages:
1: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1],  :
  There are ties in the preranked stats (0.17% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u2],  :
  There are ties in the preranked stats (0.1% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
3: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1],  :
  There are ties in the preranked stats (0.16% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
4: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u2],  :
  There are ties in the preranked stats (0.16% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
5: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1],  :
  There are ties in the preranked stats (0.17% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
6: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u2],  :
  There are ties in the preranked stats (0.29% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
7: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1],  :
  There are ties in the preranked stats (0.18% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
8: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1],  :
  There are ties in the preranked stats (0.18% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
9: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1],  :
  There are ties in the preranked stats (0.16% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2              fgsea_1.6.0                 Rcpp_0.12.18                DESeq2_1.20.0              
 [5] SummarizedExperiment_1.10.1 DelayedArray_0.6.5          BiocParallel_1.14.2         Biobase_2.40.0             
 [9] GenomicRanges_1.32.6        GenomeInfoDb_1.16.0         IRanges_2.14.11             S4Vectors_0.18.3           
[13] BiocGenerics_0.26.0         tximport_1.8.0              matrixStats_0.54.0          genefilter_1.62.0          
[17] ballgown_2.12.0             forcats_0.3.0               stringr_1.3.1               dplyr_0.7.6                
[21] purrr_0.2.5                 readr_1.1.1                 tidyr_0.8.1                 tibble_1.4.2               
[25] ggplot2_3.0.0               tidyverse_1.2.1            

loaded via a namespace (and not attached):
 [1] nlme_3.1-137             bitops_1.0-6             lubridate_1.7.4          bit64_0.9-7              RColorBrewer_1.1-2      
 [6] httr_1.3.1               tools_3.5.1              backports_1.1.2          R6_2.2.2                 rpart_4.1-13            
[11] Hmisc_4.1-1              DBI_1.0.0                lazyeval_0.2.1           mgcv_1.8-24              colorspace_1.3-2        
[16] nnet_7.3-12              withr_2.1.2              gridExtra_2.3            tidyselect_0.2.4         bit_1.1-14              
[21] compiler_3.5.1           cli_1.0.0                rvest_0.3.2              htmlTable_1.12           xml2_1.2.0              
[26] labeling_0.3             rtracklayer_1.40.6       checkmate_1.8.5          scales_1.0.0             digest_0.6.17           
[31] Rsamtools_1.32.3         foreign_0.8-71           XVector_0.20.0           htmltools_0.3.6          base64enc_0.1-3         
[36] pkgconfig_2.0.2          limma_3.36.3             htmlwidgets_1.2          rlang_0.2.2              readxl_1.1.0            
[41] rstudioapi_0.7           RSQLite_2.1.1            bindr_0.1.1              jsonlite_1.5             acepack_1.4.1           
[46] RCurl_1.95-4.11          magrittr_1.5             GenomeInfoDbData_1.1.0   Formula_1.2-3            Matrix_1.2-14           
[51] munsell_0.5.0            stringi_1.2.4            zlibbioc_1.26.0          plyr_1.8.4               grid_3.5.1              
[56] blob_1.1.1               crayon_1.3.4             lattice_0.20-35          Biostrings_2.48.0        haven_1.1.2             
[61] splines_3.5.1            annotate_1.58.0          hms_0.4.2                locfit_1.5-9.1           knitr_1.20              
[66] pillar_1.3.0             geneplotter_1.58.0       fastmatch_1.1-0          XML_3.98-1.16            glue_1.3.0              
[71] latticeExtra_0.6-28      data.table_1.11.4        modelr_0.1.2             cellranger_1.1.0         gtable_0.2.0            
[76] assertthat_0.2.0         xtable_1.8-3             broom_0.5.0              survival_2.42-6          GenomicAlignments_1.16.0
[81] AnnotationDbi_1.42.1     memoise_1.1.0            cluster_2.0.7-1          sva_3.28.0 

ranked_genes.tsv.gz c5.bp.v6.2.symbols.gmt.gz

ajw2329 avatar Oct 08 '18 23:10 ajw2329

Hi,

Can you, please, attach the data so that I can reproduce the error?

assaron avatar Oct 09 '18 10:10 assaron

Thanks for your reply. I've attached the ranked genes (as a two column TSV with gene in col 1 and score in col 2 ordered by score descending) as well as the exact .gmt file I used

c5.bp.v6.2.symbols.gmt.gz ranked_genes.tsv.gz

.

Thanks again! Andrew

ajw2329 avatar Oct 09 '18 18:10 ajw2329

Thanks, I can reproduce the problem. Quickfix is to remove NA-ranked genes:

ranked_genes <- ranked_genes[!is.na(ranked_genes)]

assaron avatar Oct 10 '18 15:10 assaron

Thanks very much! I'm guessing the reason that lowering the padj threshold to 0.01 worked was because pathways that included genes with NA rank were coincidentally removed ?

ajw2329 avatar Oct 10 '18 19:10 ajw2329

Not sure, haven't dug deeper yet. I guess it's just mishandling of NAs somewhere in the code.

assaron avatar Oct 10 '18 19:10 assaron

Hello all,

I open again this thread because I got the same error in my pipeline. Although it is a different function, probably you can help me. I used the same script in different samples, but I see it in special occasions. Despite the small number of genes, it fits the condition of the argument "minGSSize".

This is the piece of code where I have the error:

>     gseaGO <- gseGO(geneList = foldchanges, 
+                     OrgDb = org.Hs.eg.db, 
+                     ont = 'BP',
+                     nPerm = 1000, 
+                     minGSSize = 20, 
+                     pvalueCutoff = 0.05,
+                     verbose = FALSE) 
Error in FUN(X[[i]], ...) : 
  GSEA statistic is not defined when all genes are selected
In addition: Warning messages:
1: In .GSEA(geneList = geneList, exponent = exponent, minGSSize = minGSSize,  :
  We do not recommend using nPerm parameter incurrent and future releases
2: In fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize = minGSSize,  :
  You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.

"foldchanges" has no NA:

> foldchanges
    378938     283131       7033       3956       4495        165     353500       4256       3040       7169       6366       2934      10974 
 0.8090422  0.3013649  0.2970719 -0.2551876 -0.2563800 -0.2581390 -0.2674187 -0.2710588 -0.3525822 -0.3597963 -0.3628651 -0.4186843 -0.4382191 
        59      10398       3043       3503       6876       3539       3500       3538       3537       3514       3502 
-0.5312216 -0.6030342 -0.6254000 -0.6547751 -0.6818369 -0.7128064 -0.8510006 -0.8714765 -0.9875951 -1.0781393 -1.1180770 

Here I write the sessionInfo() if it is helpful:


R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS


locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=es_ES.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=es_ES.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] doSNOW_1.0.19          snow_0.4-4             iterators_1.0.13       foreach_1.5.1          harmony_0.1.0          Rcpp_1.0.7            
 [7] fossil_0.4.0           shapefiles_0.7         foreign_0.8-81         maps_3.3.0             sp_1.4-5               cowplot_1.1.1         
[13] patchwork_1.1.1        ggpubr_0.4.0           GSEABase_1.52.1        graph_1.68.0           annotate_1.68.0        XML_3.99-0.8          
[19] clusterProfiler_3.18.1 openxlsx_4.2.4         clustree_0.4.3         ggraph_2.0.5           data.table_1.14.2      org.Hs.eg.db_3.12.0   
[25] AnnotationDbi_1.52.0   IRanges_2.24.1         S4Vectors_0.28.1       Biobase_2.50.0         BiocGenerics_0.36.1    formattable_0.2.1     
[31] forcats_0.5.1          stringr_1.4.0          purrr_0.3.4            tidyr_1.1.3            tibble_3.1.6           tidyverse_1.3.1       
[37] readr_2.0.1            dplyr_1.0.7            ggplot2_3.3.5          SeuratObject_4.0.2     Seurat_4.0.4           DOSE_3.16.0           

loaded via a namespace (and not attached):
  [1] utf8_1.2.2            reticulate_1.20       tidyselect_1.1.1      RSQLite_2.2.9         htmlwidgets_1.5.4     grid_4.0.3           
  [7] BiocParallel_1.24.1   Rtsne_0.15            scatterpie_0.1.7      munsell_0.5.0         codetools_0.2-18      ica_1.0-2            
 [13] future_1.22.1         miniUI_0.1.1.1        withr_2.4.2           colorspace_2.0-2      GOSemSim_2.16.1       knitr_1.33           
 [19] rstudioapi_0.13       ROCR_1.0-11           ggsignif_0.6.2        tensor_1.5            listenv_0.8.0         labeling_0.4.2       
 [25] polyclip_1.10-0       bit64_4.0.5           farver_2.1.0          downloader_0.4        parallelly_1.27.0     vctrs_0.3.8          
 [31] generics_0.1.0        xfun_0.25             doParallel_1.0.16     R6_2.5.1              graphlayouts_0.7.1    spatstat.utils_2.2-0 
 [37] cachem_1.0.6          fgsea_1.16.0          assertthat_0.2.1      promises_1.2.0.1      scales_1.1.1          enrichplot_1.10.2    
 [43] gtable_0.3.0          globals_0.14.0        goftest_1.2-2         tidygraph_1.2.0       rlang_0.4.12          splines_4.0.3        
 [49] rstatix_0.7.0         lazyeval_0.2.2        checkmate_2.0.0       spatstat.geom_2.2-2   broom_0.7.9           BiocManager_1.30.16  
 [55] reshape2_1.4.4        abind_1.4-5           modelr_0.1.8          backports_1.2.1       httpuv_1.6.2          qvalue_2.22.0        
 [61] tools_4.0.3           ellipsis_0.3.2        spatstat.core_2.3-0   RColorBrewer_1.1-2    ggridges_0.5.3        plyr_1.8.6           
 [67] rpart_4.1-15          deldir_0.2-10         pbapply_1.4-3         viridis_0.6.1         zoo_1.8-9             haven_2.4.3          
 [73] ggrepel_0.9.1         cluster_2.1.2         fs_1.5.0              magrittr_2.0.1        RSpectra_0.16-0       scattermore_0.7      
 [79] DO.db_2.9             lmtest_0.9-38         reprex_2.0.1          RANN_2.6.1            fitdistrplus_1.1-5    matrixStats_0.61.0   
 [85] hms_1.1.0             mime_0.12             evaluate_0.14         xtable_1.8-4          rio_0.5.27            readxl_1.3.1         
 [91] gridExtra_2.3         compiler_4.0.3        KernSmooth_2.23-20    crayon_1.4.1          shadowtext_0.0.8      htmltools_0.5.2      
 [97] ggfun_0.0.3           mgcv_1.8-36           later_1.3.0           tzdb_0.1.2            lubridate_1.7.10      DBI_1.1.2            
[103] tweenr_1.0.2          dbplyr_2.1.1          MASS_7.3-54           car_3.0-11            Matrix_1.3-4          cli_3.0.1            
[109] igraph_1.2.6          pkgconfig_2.0.3       rvcheck_0.1.8         plotly_4.9.4.1        spatstat.sparse_2.0-0 xml2_1.3.2           
[115] rvest_1.0.1           digest_0.6.29         sctransform_0.3.2     RcppAnnoy_0.0.19      spatstat.data_2.1-0   rmarkdown_2.10       
[121] cellranger_1.1.0      leiden_0.3.9          fastmatch_1.1-3       uwot_0.1.10           curl_4.3.2            shiny_1.6.0          
[127] lifecycle_1.0.0       nlme_3.1-153          jsonlite_1.7.2        carData_3.0-4         limma_3.46.0          viridisLite_0.4.0    
[133] fansi_0.5.0           pillar_1.6.2          lattice_0.20-44       fastmap_1.1.0         httr_1.4.2            survival_3.2-13      
[139] GO.db_3.12.1          glue_1.6.0            zip_2.2.0             png_0.1-7             bit_4.0.4             ggforce_0.3.3        
[145] stringi_1.7.4         blob_1.2.2            memoise_2.0.1         irlba_2.3.3           future.apply_1.8.1   

Thanks a lot!

nsdelablancac avatar Jan 12 '22 10:01 nsdelablancac

@nsbc6 you should put all the expressed genes into fgsea, not just a small subset which you apparently have.

assaron avatar Jan 13 '22 11:01 assaron

I am running into the same error "GSEA statistic is not defined when all genes are selected" when performing the following:

fgseaMultilevel(pathways = pathways, stats = prerank.genes, minSize = 20, maxSize = 500, eps = 0, scoreType = "pos", nPermSimple = 1000)

where pathways is a list of 900 lists of genes between 1 and 500 genes (mostly 500), derived from scRNA clusters. This code is run on many different vectors of prerank.genes and works in most cases. However, in specific cases, I get the above error message. Upon closer inspection, prerank.genes has a reasonable number of genes (e.g. N = 34). This issue seems to be restricted to situations where all prerank.genes are in one of the pathways lists, but not in all cases when this is true. There are no NA-ranked genes. Any suggestion on what formally contributes to this error and how to avoid it best? Thanks a lot!

felixradtke avatar Feb 15 '23 10:02 felixradtke

I wouldn't say 34 is reasonable amount of genes for GSEA. But formally, one of your input pathways contain all of these 34 genes, which crashes GSEA, as we can't calculate an enrichment of 34 genes pathway in 34 genes ranking.

I'll add a code to force maxSize to be length(stats)-1 at max. Then there will be no error and this pathway just won't appear in the result.

assaron avatar Feb 15 '23 17:02 assaron

@assaron ha, fair point - reasonable in the specific context I was referring to. Tested in dev version - works perfectly now on my end. Thank you for the extremely quick fix!

felixradtke avatar Feb 15 '23 23:02 felixradtke