parsnp One nucleotide shift for every SNP Positions

Hello,

I ran parsnp version 2.1.1 and 2.0.6, but both of them found wrong locations for SNP positions. For example, it should be 12730 but these versions found 12731. I made the VCF file from .ggr file by this code: harvesttools -i parsnp.ggr -V parsnp.vcf. I, also, ran parsnp version 1.5.4 and 1.5.3, but although they showed the correct posiotions, they did not make the .mblocks file and I need the SNP matrix for my project.

I appriciate your help.

Best, Mahsa

Dec 10 '24 22:12 DrMahsaRahnama76

Hi,

The issue is although you mentioned "use 0-based start-end coords for .xmfa. Fixes https://github.com/marbl/parsnp/issues/169", still the output of pasnp 2.1.1 is one nucleotide shift for every SNP Position. I tried to use parsnp 2.1.1 for 6 isolates from NCBI ( SRR863393, SRR863392, SRR1264965, SRR1514990, SRR1586566, and SRR1586589) with the reference sequence of NZ_CP016581.1. I got the VCF file from parsnp.ggr and the positions are like 12731 21098 34097 62195 157793 164691 218493 ...

instead of: 12730 21097 34096 62194 70249 157792 164690 218492 ...

In addition, I need the exact positions of bases in parsnp.snps.mblocks file which is used to build parsnp.tree file. The VCF file I got showes 189 positions, while parsnp.snps.mblocks includes 120 bases for each isolate. Can you please tell me how I can have access to the exact positions of bases that are used to build parsnp.tree?

Thank you so much. Best, Mahsa

Dec 12 '24 19:12 DrMahsaRahnama76

Hi @DrMahsaRahnama76 ,

Yes, the fix will apply to the next release of Parsnp (2.1.2). However, this issue unfortunately may be with HarvestTools, which expects the xmfa file to be 0-indexed when it should be 1-indexed. I'll reopen this for now until it is fixed.

As far as the mblocks file goes, the reason for the 120 vs 189 difference is that the SNPs used to build the tree are a filtered subset (for example, no indels). There's currently no way to recover the coordinates from the mblocks file alone.

Best, Bryce

Dec 12 '24 19:12 bkille

Could you update parsnp in bioconda from 2.1.1 to 2.1.2, please?

Jan 08 '25 17:01 fish2022Jul

Hi,

do you have a time window for when this issue with HarvestTools may be adressed and fixed? Unfortunately an upgrade to parsnp 2.1.2 did not fix the issue mentioned above.

Also, the results we gathered from parsnp2 not only differ from the results gathered with parsnp1, but most of the SNPs also get flagged as homoplastic SNPs instead of unique ones. Maybe this issue is connected?

Kind regards, Kadsae

Jan 14 '25 14:01 Kadsae

@fish2022Jul, Parsnp v2.1.2 is on conda now. Please let me know if you run into any issues.

@Kadsae, to clarify, you are seeing off by one errors in v2.1.2 as well? The coordinates in 2.1.2 should be shifted (corrected) relative to other Parsnp 2 releases:

==> vcf-coords-2.1.1/parsnp.vcf <==
#CHROM  POS     ID      REF     ALT
England1.split.fna.ref  926     CTCAAGTTTG.GTTACTTCTG   G       A
England1.split.fna.ref  1518    TGGTTTAATG.CGTTGCGTGA   C       T
England1.split.fna.ref  2395    GTTTGTGTGC.GTTCCTGACT   G       A
England1.split.fna.ref  3136    TCTGGTGCCC.ACGACACGTA   A       G

==> vcf-coords-2.1.2/parsnp.vcf <==
#CHROM  POS     ID      REF     ALT
England1.split.fna.ref  925     CTCAAGTTTG.GTTACTTCTG   G       A
England1.split.fna.ref  1517    TGGTTTAATG.CGTTGCGTGA   C       T
England1.split.fna.ref  2394    GTTTGTGTGC.GTTCCTGACT   G       A
England1.split.fna.ref  3135    TCTGGTGCCC.ACGACACGTA   A       G

Also, I've create a new issue (#171) regarding your second question.

Jan 16 '25 21:01 bkille

Hi,

I ran version 2.1.2, but the problem still exists. Can you please tell me which version of Harvesttools you used? You mentioned in your last comment that the positions are only different and ALT are the same, but in my results, I got the wrong ALT according to their wrong positions. For example, it showed 12731 with ALT G instead of 12730 with ALT A.

Thank you. Best, Mahsa

Jan 17 '25 16:01 DrMahsaRahnama76

Hi @bkille,

thank you for the fast responses and the eagerness to solve my issue(s).

For some more clarification: We have an inhouse database with around 800 bacterial genomes. For testing I ran the parsnp command with standard parameters on this database with parsnp versions 1.6.2, 2.1.1 and 2.1.2.

Command: parsnp -p $THREADS -v --vcf -C 1000 -e -u -d $GENOMES -r $REF -c -o $OUTDIR

For v1.6.2 I got 9719 SNPs reported, in comparison to published SNPs at the same positions. For v2.1.1 I got 5998 SNPs reported, in comparison to published SNPs exactly off by one (+1). For v2.1.2 I got 6300 SNPs reported, in comparison to published SNPs exactly off by one (+1).

During the v2.1.2 run, one partition failed - therefore 50 genomes got excluded from the analysis.

So yes, it seems the nucleotide shift is still there.

Best, Kadsae

Jan 20 '25 12:01 Kadsae

Hi @Kadsae,

This is very strange. Other than some error reporting improvements, the only change between v2.1.1 and v2.1.2 is the use of the ##FormatVersion Mauve header instead of ##FormatVersion Parsnp in the xmfa output file. This is what signals Harvest Tools to treat the xmfa as 0-indexed. I'm surprised to see you run into a partition failure with one version but not another.

With respect to the nucleotide shift issue, I created separate conda envs for v1.6.2, v2.1.1, and v2.1.2 and ran on example data. The coordinates of v2.1.2 match v1.6.2 and are indeed -1 relative to v2.1.1.

>$ tail example-out-v*/*.vcf -n 5 | cut -f1-5
==> example-out-v1.6.2/parsnp.vcf <==
gi|471258596|gb|KC164505.2|     29722   CTAAGGAGCA.TCGTGTGCAA   T       G
gi|471258596|gb|KC164505.2|     29742   GGTACTCAGC.GCACTCGCAC   G       A
gi|471258596|gb|KC164505.2|     29804   TGATTAGTGT.CACTCAAAGT   C       T
gi|471258596|gb|KC164505.2|     29844   TTGTGTTTGG.TAACCCCATC   T       C
gi|471258596|gb|KC164505.2|     29847   TGTTTGGTAA.CCCCATCTCA   C       T

==> example-out-v2.1.1/parsnp.vcf <==
England1.fna.ref        29723   CTAAGGAGCA.TCGTGTGCAA   T       G
England1.fna.ref        29743   GGTACTCAGC.GCACTCGCAC   G       A
England1.fna.ref        29805   TGATTAGTGT.CACTCAAAGT   C       T
England1.fna.ref        29845   TTGTGTTTGG.TAACCCCATC   T       C
England1.fna.ref        29848   TGTTTGGTAA.CCCCATCTCA   C       T

==> example-out-v2.1.2/parsnp.vcf <==
England1.fna.ref        29722   CTAAGGAGCA.TCGTGTGCAA   T       G
England1.fna.ref        29742   GGTACTCAGC.GCACTCGCAC   G       A
England1.fna.ref        29804   TGATTAGTGT.CACTCAAAGT   C       T
England1.fna.ref        29844   TTGTGTTTGG.TAACCCCATC   T       C
England1.fna.ref        29847   TGTTTGGTAA.CCCCATCTCA   C       T

Are you using the version from Bioconda, or a locally built version? My only guess is that if you are using a locally built Parsnp install, the parsnp_core binary hasn't been rebuilt in order to output XMFA files w/ the Mauve header, other than that its not immediately clear what could be causing this discrepancy...

Please feel free to email me if you'd prefer to discuss more offline.

Jan 23 '25 05:01 bkille

Hi @bkille,

I did indeed use a locally built version for my last tests. This time I created new conda envs and used the bioconda versions of parsnp v1.6.2 and parsnp 2.1.2 respectively.

Using the same dataset as last time (800 bacterial genomes), I got the same results as last time. Surprisingly, I did not run into a partition failure this time.

I also did another test and tried to experiment with the parameter partition-size. First, I created a subset of 50 of our 800 bacterial genomes, then I ran v1.6.2 and v2.1.2 - once with the flag -no-partition and once with the flag -min-partition-size set to 2 partitions.

To my surprise, the results from v1.6.2 and v2.1.2 without partitioning match! The run with partitioning again showed SNPs off by +1 to the other runs. So it seems that the error with the nucleotide shifts has something to do with the partitioning of parsnp2 for bigger datasets.

Another interesting aspect is that v1.6.2 reported 2703 SNPs, v2.1.2 w/o partitioning 2786 SNPs and v2.1.2 w/ partitioning 2211 SNPs. You mentioned in #171 your findings showed that parsnp2 should find more SNPs than parsnp1 and without the partitioning rule this statement seems to be correct.

Lastly, I also checked the status of the SNPs across all three runs and for v1.6.2 and v2.1.2 w/o partitioning close to all of the SNPs are unique, while the run of v2.1.2 w/ partitioning again shows the problem reported in #171 (many homoplastic SNPs).

Summarized, I think both problems, this one and #171, are connected and have something to do with the partitioning-flag.

I hope I could help a little. If you want more info or discuss more offline feel free to email me as well.

All the best, Kadsae

Jan 23 '25 09:01 Kadsae

Hi @Kadsae

I finally got around to working on this and wanted to say thank you for the details! You were right that the issue was specific to partitioning. The partitioned XMFA file header still had the outdated #FormatVersion Parsnp v1.1. It has now been updated to #FormatVersion Mauve and the off-by-one errors have been resolved. Please reopen if you still notice any issues and thank you again.

-Bryce

Apr 29 '25 21:04 bkille