scrublet
scrublet copied to clipboard
Unimodal distributions, log transformation, and homogeneous data sets
Good afternoon,
Thanks for making this package @swolock! It runs great (and fast).
I have a few questions regarding the 'quest for bimodality' of the doublet scores for the simulated doublets, and log transformation of the data. Most of the attempts I've made adjusting the parameters do not yeild a bimodal distribution. The attempts that were the closest to bimodal either got wildly inaccurate estimated overall doublet rate, or uselessly low detected doublet rate. As not to burry the lead too much, I've been getting the best results by ignoring the 'quest for bimodality' while setting log_transform=True
in scrub_doublets
and manually setting the call_doublets
threshold. 'Best results' meaning that estimated overall doublet rate is consistent with the expected and the detected doublet rate is appreciable. These settings also result in descent overlap with DoubletFinder and DoubletDection.
Example code of 'best results':
scrub = scr.Scrublet(counts_matrix,
expected_doublet_rate=0.008*counts_matrix.shape[0]/1000,
sim_doublet_ratio=2)
temp_scores, temp_doublets = scrub.scrub_doublets(min_counts=2,
min_cells=3,
min_gene_variability_pctl=60,
n_prin_comps=30c,
log_transform=True,
mean_center=True,
normalize_variance=True,
synthetic_doublet_umi_subsampling = 1)
temp_doublets = scrub.call_doublets(threshold=0.3)
The output:
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.53
Detected doublet rate = 4.2%
Estimated detectable doublet fraction = 40.0%
Overall doublet rate:
Expected = 21.0%
Estimated = 10.6%
Elapsed time: 45.1 seconds
Detected doublet rate = 15.6%
Estimated detectable doublet fraction = 80.8%
Overall doublet rate:
Expected = 21.0%
Estimated = 19.3%
The resulting histograms for one sample:
And embeddings for one sample:
Initial overlap with DoubletFinder and DoubletDetection for all samples:
My data set is of Drosophila neuronal and glial cells. The data sets have a very high exprected doublet rate (~25%, due too overloading) as well as a high level of contamination due too ambient/background RNA. As you mentioned in https://github.com/swolock/scrublet/issues/3, there is likely not as much complexity in the data as Scrublet expects/assumes (and this isn't helped by the ambient/background RNA).
So, my questions are:
-
Do you see an issue with optimizing for the estimated doublet rate and detected doublet rate while ignoring whether the simulated doublets doublet scores are unimodal?
-
Why is the default to not log transform the data prior to PCA? And do you forsee any issues with setting
log_transform=True
inscrub_doublets
? -
How would you advice changing the default parameters for a much more homogeneous sample?
-
Can you comment on how heterogeneous Scublet expects/assumes cell types to be? And should I not expect to see a bimodal distribution when specifically comparing different neuronal populations?
Thanks again for your efforts.
Aaron.