parsnpAligner.log sequence lengths discrepancies
Hello Parsnp devs,
I noticed that the sequence lengths reported in the parsnpAligner.log file do not match the actual sequence lengths of the fasta files provided to Parsnp. The values are larger than expected. How are these values calculated, and do these inflated sequence lengths affect the cluster coverages reported in the parsnpAligner.log file?
I've attached an example of a log file demonstrating this phenomenon. For this run, I used the -r ! option to randomly choose a reference. Sample1, which has a sequence length of 3987309 bps, was chosen as a reference. The sequence length was correctly reported for Sequence 1 (Sample1.fa.ref, line 3) but incorrectly reported as 4035049 bps for Sequence 6 (Sample1.fa, line 28). The other sequence lengths are also incorrectly reported.
I appreciate any assistance on this,
Best, Abigail
@skoren @treangen I'm not quite sure who to tag, so I apologize if this is incorrect. We are also encountering this issue in our pipeline. I was curious if you had any suggestions. Thank you!
Hi @evagunawan and @AbigailShockey,
You are correct that (1) those lengths may be incorrect and that (2) it may affect the cluster coverage calculation. TLDR, the more contigs you have in a query sequence, the more impact this issue will have. This should be a straightforward fix, though, and we should be able to address it in the next release. Thanks for pointing it out to us!
Internally, Parsnp handles multi-contig query sequences by concatenating them, using blocks of Ns to separate each contig. If you have many short contigs at the end, this will lead to a relatively larger artificial increase in the reported "genome size" and therefore an artificial decrease in the "cluster coverage." The reference sequence doesn't need this treatment as the query sequences are matched against it, so that is why you see the correct length for the reference.
Ah, thank you so much for responding, that is super helpful to know! Do we have a rough estimate of when the next release will be? I just want to make sure that I get our pipeline updated with the newest version as soon as possible. Thank you so much for looking into this!
Hi @evagunawan,
My apologies for the delay. I'm cutting a new release today with this issue fixed (you can see the referenced commit above for details). It should be on Bioconda sometime in the next few days. Please reopen the issue if you still see problems with the reported sequence lengths.
-Bryce