pheniqs icon indicating copy to clipboard operation
pheniqs copied to clipboard

Last record missing in barcode corrected BAM file

Open hukai916 opened this issue 3 years ago • 8 comments

Hi developers,

I encountered a mysterious issue when using Pheniqs for barcode correction.

Basically, when performing the correction, I first prepare a CRAM file containing barcodes to be corrected and a json file according to the tutorial, then run Pheniqs. However, the resulting BAM files is one record less than the CRAM, and the missing one seems to be always the last record in the CRAM.

This happens only occasionally. For example, I run the same code for 6 times and capture one (bam3):

$ samtools view -c pbmc_500_10p_2_aa.corrected.bam
450000
$ samtools view -c pbmc_500_10p_2_aa.corrected.bam1
450000
$ samtools view -c pbmc_500_10p_2_aa.corrected.bam2
450000
$ samtools view -c pbmc_500_10p_2_aa.corrected.bam3
449999
$ samtools view -c pbmc_500_10p_2_aa.corrected.bam4
450000
$ samtools view -c pbmc_500_10p_2_aa.corrected.bam5
450000

Not a big matter, but still annoying, I can provide more info if you would like to look into it.

Thanks,

--Kai

hukai916 avatar Dec 24 '21 17:12 hukai916

Ill need more details. but is it possible one read fails decoding and the filter outgoing qc fail or filter incoming qc fail or is on?

moonwatcher avatar Dec 25 '21 02:12 moonwatcher

What info do you need? I am attaching my test data and code here: https://www.dropbox.com/sh/6idqunobck72q5j/AAClfpkwhusMVDPNx6HU7wSYa?dl=0

Note that bam2 is missing the last record where bam1 and bam3 are correct. Let me know if you need other info. Thanks!

hukai916 avatar Dec 25 '21 04:12 hukai916

@hukai916 so all bam files in the dropbox folder are outputs from the cram file? Can you please post the json config file you use and which version of pheniqs you are using?

log_decode2.txt seems to be the one showing a missing record. 450000 are reported in incoming and only 449999 are reported in sample. being the last record does point to some thread synchronization issue but obviously an extremely rare one since I have never encountered it. How many cores is this running on?

moonwatcher avatar Dec 25 '21 18:12 moonwatcher

Hi, all bams are created from the same cram using the same pheniqs command. The missing-last-record cases occur occasionally, roughly 10%-20%, no matter how many cores I use. I will give more info after the holidays.

hukai916 avatar Dec 26 '21 01:12 hukai916

Can you please post pbmc_500_5p_3_aa.json and I will try and replicate it.

L.

moonwatcher avatar Dec 26 '21 01:12 moonwatcher

Hi L., I have uploaded the json file to the same folder. The Pheniqs I used is: pheniqs version 2.1.0 @moonwatcher

--Kai

hukai916 avatar Dec 28 '21 04:12 hukai916

Hello Kai

Just a quick update., I have been trying to reproduce this but have so far been unsuccessful. I tried both with a build of pheniqs 2.1.0 and with head.

A side comment: using an extension of ".bam1" pheniqs does not detect that you actually wanted bam encoding as so it will revert to the default SAM output. This is why the files you provided are bigger than I expected. They are actually uncompressed, simple text in SAM format. You can override the default output format with --format bam.

Are you observing this on MacOS or Linux? What exact platform? might also be useful to know what dependencies. Did you build the binary yourself or installed it from Conda?

It is possible this is related to one of the dependencies. Since this is so tricky to replicate, will you be willing to test on custom build? I can provide instructions if you are interested.

moonwatcher avatar Jan 10 '22 18:01 moonwatcher

Thanks L.,

I installed Pheniqs with conda on a Ubuntu Xenial container (Docker). The container contains solely miniconda, Pheniqs, pysam, and SAMTools. The container can be pulled by:

docker pull hukai916/pheniqs_xenial:0.2

I can perform more tests, pls share instructions. Thanks!

--Kai

hukai916 avatar Jan 11 '22 01:01 hukai916