UMI deduplication is not done for calculating nuclear fraction

Open hsuknowledge opened this issue 7 months ago • 1 comments

In DropletQC source code I see that nuclear fraction tags are tallied from all read entries regardless of UMI. But in the Velocyto implementation in STARsolo, they sort unique molecules based on how many detected reads map to exon and intron regions. Here's the reference code: https://github.com/alexdobin/STAR/blob/b1edc1208d91a53bf40ebae8669f71d50b994851/source/SoloFeature_countVelocyto.cpp#L102

As a result, nuclear fraction calculated from STARsolo Velocyto output will look like this, discretized at the bottom, as long as I set a low enough UMI threshold (30 in this case):

In contrast, this is nuclear fraction calculated from CellRanger bam tags with nuclear_fraction_tags():

The discretized points from STARsolo Velocyto output are preventing the density estimation algorithm to work as intended.

Aug 25 '25 09:08 hsuknowledge

From our own data (analyzed via STARsolo Velocyto), we find that on the log-log plot, we can draw a straight line to divide the two groups, which incidentally has a perfect slope of -1 and translates to an exponential decay curve on the log-linear plot. Not sure if this generalizes to other datasets, too.

For that dividing curve, xy = constant, and since Nuclear Fraction = Unspliced / UMI_sum, we can instead plot log(Nuclear Fraction) to log(Unspliced), or log(Unspliced) to log(UMI) here. Interesting, because this suggests that empty droplets are clear-cut having below N unspliced counts.

Update

For most data sets I tried, if I plot log(unspliced) to log(umi), I see a gap that is not vertical but can still be expressed as a linear formula.

Here is one from 10X13_4 from La Manno et al. 2021. I also draw another line to indicate the upper limit for each UMI level.

Here is one from 10X66_4 from La Manno et al. 2021. In this sample more red blood cells were captured (leftmost group), as we can see from profiling the curious cluster (Seurat cluster 2). Otherwise, it still shows that viable cells form a group that can be visually separated from empty droplets or red blood cells. I also include the classification result with EmptyDrops for comparison. Generally EmptyDrops will call cells from the red blood cell region, call some false positives from empty droplets visualized with DropletQC, and sometimes not detect cells among the right hand side, depending on the lower parameter set that it uses to calculate an ambient profile to test with.

Aug 25 '25 15:08 hsuknowledge