fgsea
fgsea copied to clipboard
collapsePathways() errors in padj threshold-dependent fashion after successful fgsea() run
Hello,
First of all, thanks for writing/maintaining such a fantastic package! I have recently encountered an issue with the collapsePathways() function in fgsea 1.6.0, in which I get the following error after a successful fgsea() run: "Error in FUN(X[[i]], ...) : GSEA statistic is not defined when all genes are selected"
I am running collapsePathways() as described in the vignette (see below) though filtering my input fgseaRes df such that padj < 0.05 rather than 0.01. On some datasets this has worked fine but on others I get the above error. Interestingly, lowering the padj threshold down to 0.01 alleviates the problem. Any help understanding this would be much appreciated!
I've added some code below showing my calls, error messages, and sessionInfo(). The .gmt file was downloaded from MSigDB. I'd also be happy to email a file containing the gene ranking for proper reproducibility if that would be helpful.
Thanks again! Andrew
head(ranked_genes)
KLHL1 SLC6A5 LBX1 LRP1B GLRA2 TRIM67
4021.458 3393.911 3320.055 2859.702 2844.379 2825.953
bp_go <- gmtPathways("/home/andrew/projects/daily_log/09292018/c5.bp.v6.2.symbols.gmt")
fgseaRes <- fgsea(pathways = bp_go,
stats = ranked_genes,
minSize=15,
maxSize=1000,
nperm=100000)
fgsea runs normally as far as I can tell - only warning me about some ties:
Warning message: In fgsea(pathways = bp_go, stats = ranked_genes, minSize = 15, maxSize = 1000, : There are ties in the preranked stats (0.18% of the list). The order of those tied genes will be arbitrary, which may produce unexpected results.
The results are expected for my data and give a number of significant categories:
fgseaRes %>% arrange(pval) %>% select(pathway, pval, padj, ES, NES, nMoreExtreme, size) %>% head()
pathway | pval | padj | ES | NES | nMoreExtreme | size |
---|---|---|---|---|---|---|
GO_NEUROGENESIS | 1.00E-05 | 0.006736106 | 0.7743695 | 1.315848 | 0 | 960 |
GO_CENTRAL_NERVOUS_SYSTEM_DEVELOPMENT | 1.00E-05 | 0.006736106 | 0.8157949 | 1.384007 | 0 | 615 |
GO_REGULATION_OF_NERVOUS_SYSTEM_DEVELOPMENT | 1.00E-05 | 0.006736106 | 0.7999929 | 1.355705 | 0 | 523 |
GO_HEAD_DEVELOPMENT | 1.01E-05 | 0.006736106 | 0.8081064 | 1.369204 | 0 | 513 |
GO_CELL_CELL_SIGNALING | 2.04E-05 | 0.009308852 | 0.8186953 | 1.378237 | 1 | 344 |
GO_SYNAPTIC_SIGNALING | 2.08E-05 | 0.009308852 | 0.8530924 | 1.420684 | 1 | 229 |
collapsedPathways <- collapsePathways(fgseaRes[order(pval)][padj < 0.05],
bp_go, ranked_genes)
Running collapsePathways() as above, though, gives the below error. Running it with padj < 0.01 instead eliminates the error and returns a result (though still producing the tie warnings).
Error in FUN(X[[i]], ...) :
GSEA statistic is not defined when all genes are selected
In addition: Warning messages:
1: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1], :
There are ties in the preranked stats (0.17% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u2], :
There are ties in the preranked stats (0.1% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
3: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1], :
There are ties in the preranked stats (0.16% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
4: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u2], :
There are ties in the preranked stats (0.16% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
5: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1], :
There are ties in the preranked stats (0.17% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
6: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u2], :
There are ties in the preranked stats (0.29% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
7: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1], :
There are ties in the preranked stats (0.18% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
8: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1], :
There are ties in the preranked stats (0.18% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
9: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1], :
There are ties in the preranked stats (0.16% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 fgsea_1.6.0 Rcpp_0.12.18 DESeq2_1.20.0
[5] SummarizedExperiment_1.10.1 DelayedArray_0.6.5 BiocParallel_1.14.2 Biobase_2.40.0
[9] GenomicRanges_1.32.6 GenomeInfoDb_1.16.0 IRanges_2.14.11 S4Vectors_0.18.3
[13] BiocGenerics_0.26.0 tximport_1.8.0 matrixStats_0.54.0 genefilter_1.62.0
[17] ballgown_2.12.0 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.6
[21] purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2
[25] ggplot2_3.0.0 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] nlme_3.1-137 bitops_1.0-6 lubridate_1.7.4 bit64_0.9-7 RColorBrewer_1.1-2
[6] httr_1.3.1 tools_3.5.1 backports_1.1.2 R6_2.2.2 rpart_4.1-13
[11] Hmisc_4.1-1 DBI_1.0.0 lazyeval_0.2.1 mgcv_1.8-24 colorspace_1.3-2
[16] nnet_7.3-12 withr_2.1.2 gridExtra_2.3 tidyselect_0.2.4 bit_1.1-14
[21] compiler_3.5.1 cli_1.0.0 rvest_0.3.2 htmlTable_1.12 xml2_1.2.0
[26] labeling_0.3 rtracklayer_1.40.6 checkmate_1.8.5 scales_1.0.0 digest_0.6.17
[31] Rsamtools_1.32.3 foreign_0.8-71 XVector_0.20.0 htmltools_0.3.6 base64enc_0.1-3
[36] pkgconfig_2.0.2 limma_3.36.3 htmlwidgets_1.2 rlang_0.2.2 readxl_1.1.0
[41] rstudioapi_0.7 RSQLite_2.1.1 bindr_0.1.1 jsonlite_1.5 acepack_1.4.1
[46] RCurl_1.95-4.11 magrittr_1.5 GenomeInfoDbData_1.1.0 Formula_1.2-3 Matrix_1.2-14
[51] munsell_0.5.0 stringi_1.2.4 zlibbioc_1.26.0 plyr_1.8.4 grid_3.5.1
[56] blob_1.1.1 crayon_1.3.4 lattice_0.20-35 Biostrings_2.48.0 haven_1.1.2
[61] splines_3.5.1 annotate_1.58.0 hms_0.4.2 locfit_1.5-9.1 knitr_1.20
[66] pillar_1.3.0 geneplotter_1.58.0 fastmatch_1.1-0 XML_3.98-1.16 glue_1.3.0
[71] latticeExtra_0.6-28 data.table_1.11.4 modelr_0.1.2 cellranger_1.1.0 gtable_0.2.0
[76] assertthat_0.2.0 xtable_1.8-3 broom_0.5.0 survival_2.42-6 GenomicAlignments_1.16.0
[81] AnnotationDbi_1.42.1 memoise_1.1.0 cluster_2.0.7-1 sva_3.28.0
Hi,
Can you, please, attach the data so that I can reproduce the error?
Thanks for your reply. I've attached the ranked genes (as a two column TSV with gene in col 1 and score in col 2 ordered by score descending) as well as the exact .gmt file I used
c5.bp.v6.2.symbols.gmt.gz ranked_genes.tsv.gz
.
Thanks again! Andrew
Thanks, I can reproduce the problem. Quickfix is to remove NA-ranked genes:
ranked_genes <- ranked_genes[!is.na(ranked_genes)]
Thanks very much! I'm guessing the reason that lowering the padj threshold to 0.01 worked was because pathways that included genes with NA rank were coincidentally removed ?
Not sure, haven't dug deeper yet. I guess it's just mishandling of NAs somewhere in the code.
Hello all,
I open again this thread because I got the same error in my pipeline. Although it is a different function, probably you can help me. I used the same script in different samples, but I see it in special occasions. Despite the small number of genes, it fits the condition of the argument "minGSSize".
This is the piece of code where I have the error:
> gseaGO <- gseGO(geneList = foldchanges,
+ OrgDb = org.Hs.eg.db,
+ ont = 'BP',
+ nPerm = 1000,
+ minGSSize = 20,
+ pvalueCutoff = 0.05,
+ verbose = FALSE)
Error in FUN(X[[i]], ...) :
GSEA statistic is not defined when all genes are selected
In addition: Warning messages:
1: In .GSEA(geneList = geneList, exponent = exponent, minGSSize = minGSSize, :
We do not recommend using nPerm parameter incurrent and future releases
2: In fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize = minGSSize, :
You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.
"foldchanges" has no NA:
> foldchanges
378938 283131 7033 3956 4495 165 353500 4256 3040 7169 6366 2934 10974
0.8090422 0.3013649 0.2970719 -0.2551876 -0.2563800 -0.2581390 -0.2674187 -0.2710588 -0.3525822 -0.3597963 -0.3628651 -0.4186843 -0.4382191
59 10398 3043 3503 6876 3539 3500 3538 3537 3514 3502
-0.5312216 -0.6030342 -0.6254000 -0.6547751 -0.6818369 -0.7128064 -0.8510006 -0.8714765 -0.9875951 -1.0781393 -1.1180770
Here I write the sessionInfo() if it is helpful:
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=es_ES.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=es_ES.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] doSNOW_1.0.19 snow_0.4-4 iterators_1.0.13 foreach_1.5.1 harmony_0.1.0 Rcpp_1.0.7
[7] fossil_0.4.0 shapefiles_0.7 foreign_0.8-81 maps_3.3.0 sp_1.4-5 cowplot_1.1.1
[13] patchwork_1.1.1 ggpubr_0.4.0 GSEABase_1.52.1 graph_1.68.0 annotate_1.68.0 XML_3.99-0.8
[19] clusterProfiler_3.18.1 openxlsx_4.2.4 clustree_0.4.3 ggraph_2.0.5 data.table_1.14.2 org.Hs.eg.db_3.12.0
[25] AnnotationDbi_1.52.0 IRanges_2.24.1 S4Vectors_0.28.1 Biobase_2.50.0 BiocGenerics_0.36.1 formattable_0.2.1
[31] forcats_0.5.1 stringr_1.4.0 purrr_0.3.4 tidyr_1.1.3 tibble_3.1.6 tidyverse_1.3.1
[37] readr_2.0.1 dplyr_1.0.7 ggplot2_3.3.5 SeuratObject_4.0.2 Seurat_4.0.4 DOSE_3.16.0
loaded via a namespace (and not attached):
[1] utf8_1.2.2 reticulate_1.20 tidyselect_1.1.1 RSQLite_2.2.9 htmlwidgets_1.5.4 grid_4.0.3
[7] BiocParallel_1.24.1 Rtsne_0.15 scatterpie_0.1.7 munsell_0.5.0 codetools_0.2-18 ica_1.0-2
[13] future_1.22.1 miniUI_0.1.1.1 withr_2.4.2 colorspace_2.0-2 GOSemSim_2.16.1 knitr_1.33
[19] rstudioapi_0.13 ROCR_1.0-11 ggsignif_0.6.2 tensor_1.5 listenv_0.8.0 labeling_0.4.2
[25] polyclip_1.10-0 bit64_4.0.5 farver_2.1.0 downloader_0.4 parallelly_1.27.0 vctrs_0.3.8
[31] generics_0.1.0 xfun_0.25 doParallel_1.0.16 R6_2.5.1 graphlayouts_0.7.1 spatstat.utils_2.2-0
[37] cachem_1.0.6 fgsea_1.16.0 assertthat_0.2.1 promises_1.2.0.1 scales_1.1.1 enrichplot_1.10.2
[43] gtable_0.3.0 globals_0.14.0 goftest_1.2-2 tidygraph_1.2.0 rlang_0.4.12 splines_4.0.3
[49] rstatix_0.7.0 lazyeval_0.2.2 checkmate_2.0.0 spatstat.geom_2.2-2 broom_0.7.9 BiocManager_1.30.16
[55] reshape2_1.4.4 abind_1.4-5 modelr_0.1.8 backports_1.2.1 httpuv_1.6.2 qvalue_2.22.0
[61] tools_4.0.3 ellipsis_0.3.2 spatstat.core_2.3-0 RColorBrewer_1.1-2 ggridges_0.5.3 plyr_1.8.6
[67] rpart_4.1-15 deldir_0.2-10 pbapply_1.4-3 viridis_0.6.1 zoo_1.8-9 haven_2.4.3
[73] ggrepel_0.9.1 cluster_2.1.2 fs_1.5.0 magrittr_2.0.1 RSpectra_0.16-0 scattermore_0.7
[79] DO.db_2.9 lmtest_0.9-38 reprex_2.0.1 RANN_2.6.1 fitdistrplus_1.1-5 matrixStats_0.61.0
[85] hms_1.1.0 mime_0.12 evaluate_0.14 xtable_1.8-4 rio_0.5.27 readxl_1.3.1
[91] gridExtra_2.3 compiler_4.0.3 KernSmooth_2.23-20 crayon_1.4.1 shadowtext_0.0.8 htmltools_0.5.2
[97] ggfun_0.0.3 mgcv_1.8-36 later_1.3.0 tzdb_0.1.2 lubridate_1.7.10 DBI_1.1.2
[103] tweenr_1.0.2 dbplyr_2.1.1 MASS_7.3-54 car_3.0-11 Matrix_1.3-4 cli_3.0.1
[109] igraph_1.2.6 pkgconfig_2.0.3 rvcheck_0.1.8 plotly_4.9.4.1 spatstat.sparse_2.0-0 xml2_1.3.2
[115] rvest_1.0.1 digest_0.6.29 sctransform_0.3.2 RcppAnnoy_0.0.19 spatstat.data_2.1-0 rmarkdown_2.10
[121] cellranger_1.1.0 leiden_0.3.9 fastmatch_1.1-3 uwot_0.1.10 curl_4.3.2 shiny_1.6.0
[127] lifecycle_1.0.0 nlme_3.1-153 jsonlite_1.7.2 carData_3.0-4 limma_3.46.0 viridisLite_0.4.0
[133] fansi_0.5.0 pillar_1.6.2 lattice_0.20-44 fastmap_1.1.0 httr_1.4.2 survival_3.2-13
[139] GO.db_3.12.1 glue_1.6.0 zip_2.2.0 png_0.1-7 bit_4.0.4 ggforce_0.3.3
[145] stringi_1.7.4 blob_1.2.2 memoise_2.0.1 irlba_2.3.3 future.apply_1.8.1
Thanks a lot!
@nsbc6 you should put all the expressed genes into fgsea, not just a small subset which you apparently have.
I am running into the same error "GSEA statistic is not defined when all genes are selected" when performing the following:
fgseaMultilevel(pathways = pathways, stats = prerank.genes, minSize = 20, maxSize = 500, eps = 0, scoreType = "pos", nPermSimple = 1000)
where pathways
is a list of 900 lists of genes between 1 and 500 genes (mostly 500), derived from scRNA clusters. This code is run on many different vectors of prerank.genes
and works in most cases. However, in specific cases, I get the above error message. Upon closer inspection, prerank.genes
has a reasonable number of genes (e.g. N = 34). This issue seems to be restricted to situations where all prerank.genes are in one of the pathways lists, but not in all cases when this is true. There are no NA-ranked genes. Any suggestion on what formally contributes to this error and how to avoid it best? Thanks a lot!
I wouldn't say 34 is reasonable amount of genes for GSEA. But formally, one of your input pathways contain all of these 34 genes, which crashes GSEA, as we can't calculate an enrichment of 34 genes pathway in 34 genes ranking.
I'll add a code to force maxSize to be length(stats)-1
at max. Then there will be no error and this pathway just won't appear in the result.
@assaron ha, fair point - reasonable in the specific context I was referring to. Tested in dev version - works perfectly now on my end. Thank you for the extremely quick fix!