ecotyper icon indicating copy to clipboard operation
ecotyper copied to clipboard

Tutorial 4 / scRNA Discovery / When to adjust Jaccard matrix p-value cutoff?

Open semenko opened this issue 1 year ago • 1 comments

We have a scRNA dataset where we've had difficulty discovering ecotypes with EcoTyper when using a jaccard pval cutoff of 0.05. I'm considering loosening the Jaccard p-value cutoff -- in the documentation, for Tutorial 4 y'all wrote:

When the number of samples in the scRNA-seq dataset is small, such as in the current example, we recommend this filter is disabled (p-value cutoff = 1), to avoid over-filtering the jaccard matrix. However, we encourage users to set this cutoff to lower values (e.g. 0.05), if the discovery scRNA-seq dataset contains a number of samples large enough to reliably evaluate the significance of overlaps.

The CRC example dataset is about 14,000 cells x 20,000 genes.

When would you tighten the p-value cutoff to 0.05? (What would you consider a large dataset?)

In our dataset, we have 40,000 cells x ~30,000 genes, with varying numbers of specific cells:

   2251 B.cells
  11449 CD4.T.cells
    343 CD8.T.cells
   4998 Endothelial.cells
   4246 Epithelial.cells
   6619 Fibroblasts
   4491 Monocytes.and.Macrophages
   2888 NK.cells
    670 PCs

semenko avatar Aug 03 '22 16:08 semenko

Hi,

Thanks for your interest in EcoTyper. The size of the dataset has two components that are relevant for EcoTyper, the number of cells per cell type and the number of samples the cells come from. The former influences the estimation of cell state representation in a given sample, with larger numbers likely leading to more robust representation estimation. The latter is probably even more relevant for your question, as it influences the number of data points the occurrence patterns are studied across when defining ecotypes. Having too few samples can make finding significantly coordinated interactions (p < 0.05) harder. We do not have a strict recommendation about this threshold. But, I would guess that a minimum of 15-25 samples could qualify as a large dataset. I hope it helps.

Best, The EcoTyper team

BALuca avatar Aug 05 '22 21:08 BALuca