ecotyper
ecotyper copied to clipboard
Tutorial 4 / scRNA Discovery / When to adjust Jaccard matrix p-value cutoff?
We have a scRNA dataset where we've had difficulty discovering ecotypes with EcoTyper when using a jaccard pval cutoff of 0.05. I'm considering loosening the Jaccard p-value cutoff -- in the documentation, for Tutorial 4 y'all wrote:
When the number of samples in the scRNA-seq dataset is small, such as in the current example, we recommend this filter is disabled (p-value cutoff = 1), to avoid over-filtering the jaccard matrix. However, we encourage users to set this cutoff to lower values (e.g. 0.05), if the discovery scRNA-seq dataset contains a number of samples large enough to reliably evaluate the significance of overlaps.
The CRC example dataset is about 14,000 cells x 20,000 genes.
When would you tighten the p-value cutoff to 0.05? (What would you consider a large dataset?)
In our dataset, we have 40,000 cells x ~30,000 genes, with varying numbers of specific cells:
2251 B.cells
11449 CD4.T.cells
343 CD8.T.cells
4998 Endothelial.cells
4246 Epithelial.cells
6619 Fibroblasts
4491 Monocytes.and.Macrophages
2888 NK.cells
670 PCs
Hi,
Thanks for your interest in EcoTyper. The size of the dataset has two components that are relevant for EcoTyper, the number of cells per cell type and the number of samples the cells come from. The former influences the estimation of cell state representation in a given sample, with larger numbers likely leading to more robust representation estimation. The latter is probably even more relevant for your question, as it influences the number of data points the occurrence patterns are studied across when defining ecotypes. Having too few samples can make finding significantly coordinated interactions (p < 0.05) harder. We do not have a strict recommendation about this threshold. But, I would guess that a minimum of 15-25 samples could qualify as a large dataset. I hope it helps.
Best, The EcoTyper team