popscle icon indicating copy to clipboard operation
popscle copied to clipboard

Overlap of NUM.SNPS/NUM.READS between different mixed sample results

Open JABioinf opened this issue 3 years ago • 2 comments

Thank you for developing this tool! I've run both demuxlet and freemuxlet (using 1000g-based common variant vcf as suggested, including optimization suggested by [https://github.com/aertslab/popscle_helper_tools]. I've used them for different mixed samples (3 genotype-mixes) with even different genotypes combination (but from the same type of tissue). I often have compatible results between demuxlet and freemuxlet, which makes me think that the implementation works. But I noticed that combinations of NUM.SNPS and NUM.READS by BARCODES tend to overlap even between completely unrelated samples and popscle run, which I would think is unexpected.

For instance intersecting the freemuxlet results between 2 samples of 17k cells and 15k cells, I observe more than 3k cells from the 1st sample with identical values for NUM.SNPS and NUM.READS in the second sample. This include barcodes with more than a thousand SNPs considered. Here are the values found in both samples for barcodes with the highest number of reads considered: NUM.SNPS NUM.READS 1618 1797 1626 1765 1555 1764 1256 1455 1247 1419 1166 1329 1195 1289 1192 1280 1244 1258 ..This looks in my opinion unlikely to happen by chance between unrelated samples.

I am still investigating this observation to identify its reason, and if it could come from my implementation of your software. but:

  • Have you observed such phenomenon? Is it inherent to the distribution of reads in droplets that by chance results distribution overlap?
  • If it's not expected, do you have suggestion on how to identify the source of this effect?
  • Can any of the output by demuxlet/freemuxlet help me determine how come droplet have the same information between unrelated samples?
  • Could this be a version issue for popscle?
  • or maybe because I'm filtering the vcf and bam beforehand limiting the number of reads and SNPs investigated?

Thank you for your help (or from anyone else that would have explanation for this).

JABioinf avatar Nov 08 '22 01:11 JABioinf

I'm not sure why that would happen. What options did you use? Does it correctly use UB and CB tags?

hyunminkang avatar Nov 11 '22 13:11 hyunminkang

Thanks for your reply. I've used default options through the following command: popscle dsc-pileup --sam $bamloc --vcf $vcfloc --group-list $barcodeloc --out ${sample}.demux.pileup with barcodeloc=outs/filtered_feature_bc_matrix/barcodes.tsv.gz directly from Cellranger output (10xGenomics sample) and a filtered bam file containing only reads overlapping vcf positions and a cell barcode. and: popscle demuxlet --plp Demuxlet/pileupfiles/${sample}.demux.pileup --vcf $vcfloc --field PL --out ${sample}_demuxlet or for freemuxlet: popscle freemuxlet --plp Freemuxlet/pileupfiles/${sample}.pileup --nsample 3 --out ${sample}_freemuxlet

Is there any further recommendation to change default parameters?

I've for now confirm the overall accuracy of the deconvolution of demuxlet and freemuxlet using single-genotype sample and an artificial mixture of them (merging the fasq in CellRanger) suggesting implementation and results are correct, but I'd like to still understand this observation. Best,

JABioinf avatar Nov 16 '22 16:11 JABioinf