UMI deduplication is not done for calculating nuclear fraction
In DropletQC source code I see that nuclear fraction tags are tallied from all read entries regardless of UMI. But in the Velocyto implementation in STARsolo, they sort unique molecules based on how many detected reads map to exon and intron regions. Here's the reference code: https://github.com/alexdobin/STAR/blob/b1edc1208d91a53bf40ebae8669f71d50b994851/source/SoloFeature_countVelocyto.cpp#L102
As a result, nuclear fraction calculated from STARsolo Velocyto output will look like this, discretized at the bottom, as long as I set a low enough UMI threshold (30 in this case):
In contrast, this is nuclear fraction calculated from CellRanger bam tags with nuclear_fraction_tags():
The discretized points from STARsolo Velocyto output are preventing the density estimation algorithm to work as intended.
From our own data (analyzed via STARsolo Velocyto), we find that on the log-log plot, we can draw a straight line to divide the two groups, which incidentally has a perfect slope of -1 and translates to an exponential decay curve on the log-linear plot. Not sure if this generalizes to other datasets, too.
For that dividing curve, xy = constant, and since Nuclear Fraction = Unspliced / UMI_sum, we can instead plot log(Nuclear Fraction) to log(Unspliced), or log(Unspliced) to log(UMI) here. Interesting, because this suggests that empty droplets are clear-cut having below N unspliced counts.
Update
For most data sets I tried, if I plot log(unspliced) to log(umi), I see a gap that is not vertical but can still be expressed as a linear formula.
Here is one from 10X13_4 from La Manno et al. 2021. I also draw another line to indicate the upper limit for each UMI level.
Here is one from 10X66_4 from La Manno et al. 2021. In this sample more red blood cells were captured (leftmost group), as we can see from profiling the curious cluster (Seurat cluster 2). Otherwise, it still shows that viable cells form a group that can be visually separated from empty droplets or red blood cells. I also include the classification result with EmptyDrops for comparison. Generally EmptyDrops will call cells from the red blood cell region, call some false positives from empty droplets visualized with DropletQC, and sometimes not detect cells among the right hand side, depending on the lower parameter set that it uses to calculate an ambient profile to test with.