fgsea icon indicating copy to clipboard operation
fgsea copied to clipboard

scRNA-seq: how to create the Ranked list?

Open asmariyaz23 opened this issue 5 years ago • 19 comments

Hello,

I have some clusters with differentially expresses genes generated using Seurat. I also have a highly curated gene list and I wish to perform enrichment analysis using these on each cluster picking the top X genes. I understand the differentially expressed genes in each cluster can be input to your algorithm as a pathway. But I am confused what the ranked list should be? Is it the curated gene list? If yes, could you explain the values to be input into this list?

Thank you, Asma

asmariyaz23 avatar Jul 24 '19 10:07 asmariyaz23

Asma,

I would recommend you to check the documentation about metric in the Broad implementation of GSEA https://software.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html

For many clusters the comparisons to do the metric have to depend of your dataset and your clusters (cell types, cell states). Also in my opinion negative enrichment in in a cell type with scRNAseq data is kinda of meaningless, because this is most likely to be driven by another cell type in the dataset (beside the logical exercise to explain it).

ToledoEM avatar Jul 24 '19 10:07 ToledoEM

Hi @asmariyaz23 ,

In the paper https://www.ahajournals.org/doi/pdf/10.1161/CIRCRESAHA.118.312804 (figure 7) I was checking the consistency of pathway enrichment: bulk RNA-seq vs scRNA-seq. But in my case, differential expression was done one cluster vs other (and not one cluster vs all the other, as it seems in your case) and compared to the same populations in bulk RNA-seq.

image

I tried several techniques to accomplish this, but the most robust was to just use avg_logFC to rank genes. Other approach would be to use -sign(avg_logFC) * log10(p.adj). We once had a conversation with @assaron about which score to use, but we couldn't decide which one is better. I still use plain avg_logFC.

My code was similar to this:

# whole is a Seurat object, I was comparing cluster 4 vs cluster 1
clustering <- whole@ident
cellsIn <- names(clustering[clustering == 4])
cellsOut <- names(clustering[clustering == 1])
markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST")

ranks <- markers$avg_logFC
names(ranks) <- rownames(markers)

# allPathways is predefined list of signatures
fgseaRes <- fgsea(pathways = allPathways, 
                  stats = ranks,
                  minSize=10,
                  maxSize=500,
                  nperm=1000000)

Important thing in this snippet is to set thresholds to 0 to not artificially exclude genes from the analysis. Running FindMarkers like this (with 0 thresholds) is the closest you can get to classical DE in bulk RNA-seq (however, genes that are not expressed in both groups will be excluded anyway, and this is good).

In case of one vs all other clusters, I must agree with @ToledoEM, negative enrichment won't tell you much information about this particular cluster in most of the cases, but for positive enrichment you can use snippet above.

Cheers, Konstantin

konsolerr avatar Jul 24 '19 11:07 konsolerr

Thank you for your replies. I am a newbie to analysis so excuse my naive questions. I have a curated a disease related gene list based on mutations (purely literature based), and in parallel I ran Seurat on single cell data (of a region related to this disease). Now I want to see if these clusters are enriched for the curated gene list. I have tried the fisher exact test but the odds ratio doesn't look correct, which led me in finding a package which does gene enrichment. My idea was to supply these clusters of DE genes as pathway list to fgsea, but I don't know how I should fit the curated gene list into this equation. Is this something I can do with this package?

asmariyaz23 avatar Jul 24 '19 13:07 asmariyaz23

Reopening, as this question arises regularly

assaron avatar Dec 16 '20 11:12 assaron

Unless you are using -sign(avg_logFC) * log10(p.adj) Instead of markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST") Use markers <- FoldChange(...) Then sort by fold change.

Mr-Byun avatar May 22 '23 10:05 Mr-Byun

Why not use the formula for signal-to-noise across single cell transcriptomes from both groups?

Close-your-eyes avatar Aug 22 '23 08:08 Close-your-eyes

@Close-your-eyes I guess you can use that as well if it works for you. Although I would say that logFC can be estimated more directly by the differential expression method.

assaron avatar Aug 22 '23 15:08 assaron

Hello.

What I'm doing is actually comparing post-treatment samples to pre-treatment. I'm separating positive and negative differentially-expressed genes (DEGs) and running fgsea on each. For the DEGs with positive log2FC, I take the GO terms with positive NES. If I want to find out the enriched GO terms in the latter group, should I just take the DEGs that have a negative log2FC, and then flip the ranks' sign to positive? Or should I just keep the negative ranks, but look at the negative enrichment after the GO analysis?

Thanks a lot.

amkhasawneh avatar Oct 10 '23 07:10 amkhasawneh

Hi @asmariyaz23 ,

In the paper https://www.ahajournals.org/doi/pdf/10.1161/CIRCRESAHA.118.312804 (figure 7) I was checking the consistency of pathway enrichment: bulk RNA-seq vs scRNA-seq. But in my case, differential expression was done one cluster vs other (and not one cluster vs all the other, as it seems in your case) and compared to the same populations in bulk RNA-seq.

image

I tried several techniques to accomplish this, but the most robust was to just use avg_logFC to rank genes. Other approach would be to use -sign(avg_logFC) * log10(p.adj). We once had a conversation with @assaron about which score to use, but we couldn't decide which one is better. I still use plain avg_logFC.

My code was similar to this:

# whole is a Seurat object, I was comparing cluster 4 vs cluster 1
clustering <- whole@ident
cellsIn <- names(clustering[clustering == 4])
cellsOut <- names(clustering[clustering == 1])
markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST")

ranks <- markers$avg_logFC
names(ranks) <- rownames(markers)

# allPathways is predefined list of signatures
fgseaRes <- fgsea(pathways = allPathways, 
                  stats = ranks,
                  minSize=10,
                  maxSize=500,
                  nperm=1000000)

Important thing in this snippet is to set thresholds to 0 to not artificially exclude genes from the analysis. Running FindMarkers like this (with 0 thresholds) is the closest you can get to classical DE in bulk RNA-seq (however, genes that are not expressed in both groups will be excluded anyway, and this is good).

In case of one vs all other clusters, I must agree with @ToledoEM, negative enrichment won't tell you much information about this particular cluster in most of the cases, but for positive enrichment you can use snippet above.

Cheers, Konstantin

Dear all,

By doing markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST") you will leave all genes not expressed in any cell of both clusters, and I think this is not what we want.

Don't you think it would be better to put some min.pct, for example the 0.01 (the default in Seurat5) or 0.1 to remove from the analysis genes not expressed in any of the two groups of cells? As far as I know, we should remove these genes before using MAST.

Moreover, if we use markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST") all genes with 0 counts in both groups will tie in the ranking. And one more thing, I've detected some strange behaviour in Seurat5 for the log2FC of non-expressed genes:

> fchanges <- FoldChange(seurat_t, ident.1 = "cluster1", ident.2 = "cluster2")
> head(fchanges)
        avg_log2FC pct.1 pct.2
Xkr4    -2.6394103 0.000 0.000
Gm37381 -2.6394103 0.000 0.000
Mrpl15   1.0763732 0.235 0.077
Lypla1   0.7276294 0.185 0.077
Tcea1   -0.1477601 0.383 0.308
Rgs20   -2.6394103 0.000 0.000

As you can see, for Xkr4, for example, it seems to me that the avg_log2FC is wrong, since there is no expression in any of the groups:

> sum(seurat_t_subset[["RNA"]]$data["Xkr4", ])
[1] 0

For the fgsea, if we use the FindMarkers strategy and then the -sign(avg_logFC) * log10(p.adj) method to rank the genes, then I suppose the effect will be minimum since log10(1) is zero, and the whole value will turn to zero:

> markers <- FindMarkers(
        seurat_t, ident.1 = "cluster1", ident.2 = "cluster2",
        test.use = "MAST",
        logfc.threshold = 0, min.pct = 0
    )
> markers["Xkr4", ]
   p_val avg_log2FC pct.1 pct.2 p_val_adj
Xkr4    1    -2.6394103            0    0        1

tmontserrat avatar Jun 28 '24 06:06 tmontserrat