parsnp icon indicating copy to clipboard operation
parsnp copied to clipboard

parsnpAligner.log sequence lengths discrepancies

Open AbigailShockey opened this issue 1 year ago • 3 comments

Hello Parsnp devs,

I noticed that the sequence lengths reported in the parsnpAligner.log file do not match the actual sequence lengths of the fasta files provided to Parsnp. The values are larger than expected. How are these values calculated, and do these inflated sequence lengths affect the cluster coverages reported in the parsnpAligner.log file?

I've attached an example of a log file demonstrating this phenomenon. For this run, I used the -r ! option to randomly choose a reference. Sample1, which has a sequence length of 3987309 bps, was chosen as a reference. The sequence length was correctly reported for Sequence 1 (Sample1.fa.ref, line 3) but incorrectly reported as 4035049 bps for Sequence 6 (Sample1.fa, line 28). The other sequence lengths are also incorrectly reported.

I appreciate any assistance on this,

Best, Abigail

parsnpAligner.log

AbigailShockey avatar Jan 17 '25 21:01 AbigailShockey

@skoren @treangen I'm not quite sure who to tag, so I apologize if this is incorrect. We are also encountering this issue in our pipeline. I was curious if you had any suggestions. Thank you!

evagunawan avatar Feb 27 '25 20:02 evagunawan

Hi @evagunawan and @AbigailShockey,

You are correct that (1) those lengths may be incorrect and that (2) it may affect the cluster coverage calculation. TLDR, the more contigs you have in a query sequence, the more impact this issue will have. This should be a straightforward fix, though, and we should be able to address it in the next release. Thanks for pointing it out to us!

Internally, Parsnp handles multi-contig query sequences by concatenating them, using blocks of Ns to separate each contig. If you have many short contigs at the end, this will lead to a relatively larger artificial increase in the reported "genome size" and therefore an artificial decrease in the "cluster coverage." The reference sequence doesn't need this treatment as the query sequences are matched against it, so that is why you see the correct length for the reference.

bkille avatar Feb 27 '25 21:02 bkille

Ah, thank you so much for responding, that is super helpful to know! Do we have a rough estimate of when the next release will be? I just want to make sure that I get our pipeline updated with the newest version as soon as possible. Thank you so much for looking into this!

evagunawan avatar Feb 27 '25 21:02 evagunawan

Hi @evagunawan,

My apologies for the delay. I'm cutting a new release today with this issue fixed (you can see the referenced commit above for details). It should be on Bioconda sometime in the next few days. Please reopen the issue if you still see problems with the reported sequence lengths.

-Bryce

bkille avatar Apr 29 '25 21:04 bkille