seurat icon indicating copy to clipboard operation
seurat copied to clipboard

Seurat5 default parameters of FindMarkers generate a huge amount of genes

Open diala-ar opened this issue 1 year ago • 4 comments

Thanks for the great package! I understand that in Seurat5, FindMarkers default parameters have changed especially logfc_threshold and min_pct, from 0.25 to 0.1 and from 0.1 to 0.01 respectively. I also understand that the pseudocount is now used on the level of group instead of on the level of cells which returns much higher fold changes. If I use the new default parameters of FindMarkers, a huge number of differentially expressed genes is generated (like ~9k out of 14k), so 64% of the genes are labelled as differentially expressed, which is very unexpected. In case we do not have more than 3 samples per condition to perform pseudobulk analysis, which parameters do you recommend to use? Should we go with logfc_threshold=1 and min_pct=0.1? Any help is really appreciated! Thanks.

diala-ar avatar Feb 12 '24 17:02 diala-ar

This seems strange to me, have you checked any marker genes that you expect to be expressed at a similar level in the same cell populations? Do you expect these samples to be very different technically or biologically?

It might be reasonable to use more filtration (you could try logfc_threshold = 0.25 to see if this helps the issue, or raise it further as needed).

mhkowalski avatar Feb 16 '24 21:02 mhkowalski

My understanding is that the Seurat FindMarkers() output is the results of all genes tested for differential expression, not all significantly differentially expressed genes. So if you specify a cutoff of 0.25, FindMarkers will only test genes with a log2fold change at or greater than 0.25 for testing, and the results will be stored in the output data frame regardless of significance. You can check this by plotting the distribution of p values and adjusted p values as histograms. To get the significant DEGs, you should filter for genes with the Bonferroni corrected p value (p_val_adj in the output data frame) less than 0.05. For DEG testing, the Bonferroni correction counts the initial log2fold thresholding as a test in itself, so the correction is based on all genes in the dataset.

ChristopherStephens21 avatar Feb 22 '24 04:02 ChristopherStephens21

Thanks @mhkowalski and @ChristopherStephens21 for your responses. After double checking with immunologists, they said that yes 45% of genes if not more could be diferentially expressed between CD8 naive and effector cells.

@mhkowalski , to answer your question, yes genes that should be similarly expressed in both groups are so.

@ChristopherStephens21, the number of DEGs that I mentioned are after filtering out genes with P_adj_val > 0.05.

For general purposes, so people will not be surprised if they saw the higher number of DEGs in Seurat5, I reran the same analysis using Seurat 4.2.2 and compared the number of DEGs identified by Seurat V4.2.2 to that identified by Seurat V5.0.1 and I found that Seurat5 generated between 2 to 4.5 times more DEGs than generated by Seurat4.

diala-ar avatar Feb 29 '24 14:02 diala-ar

@diala-ar I have same problem between Seurat v4 and Seurat v5. I have only around 100 DEGs in Seurat 4 so I cannot do next step analysis. Now I got around 800 DEGs in Seurat v5. I want to know can I trust results from Seurat v5? How can I explain this different? Thank you very much!

Shuresearcher avatar Apr 20 '24 17:04 Shuresearcher

As a reminder, FindMarkers outputs all genes that are tested for DE, not just the genes that are DE.

In addition, in Seurat v5 we changed the default pseudo-count to be calculated at the cluster level rather than the cell level (as described here https://satijalab.org/seurat/articles/announcements.html). This will increase the number of genes tested, and the number of DE genes returned.

The downside of this is that very lowly expressed genes (which can be detected in a low number of cells) can have very large log-FC values (i.e. a gene that is expressed at 0.1 vs 0.01 has a 10-fold change). The calculation is correct, but biologists should be wary about the importance of these results. We recommend looking at p-value, logFC, and percent detected difference when deciding what DE gene sets to follow up on.

rsatija avatar Jun 24 '24 19:06 rsatija