snippy icon indicating copy to clipboard operation
snippy copied to clipboard

Different pairwise SNP distances from snippy-core depending on inputs

Open nthomson50 opened this issue 3 years ago • 1 comments

Hi,

I am getting strange results from snippy-core: I ran contigs from 100 isolates using snippy-multi (therefore all against the same reference) and then generated a distance matrix from the core.aln file using snp-dists. So far so good. However, I then wanted to construct a phylogenomic tree, so I used snippy on some extra reference strains and outgroups individually, with the same reference as before. I then ran everything through snippy-core again and found that it gave me different pairwise SNP differences to the previous run, so I played around with some other combinations.

As an example:

  1. in my original run, there were 124 SNPs between Sample021 and Sample043.
  2. when I added the reference strains and outgroups there were no SNPs between the same two strains.
  3. when I repeated 1) but changed the --ref from 'Sample001/ref.fa' to my original ref.gbk file, the SNP distance between Sample021 and Sample043 was 25.
  4. when I only used four of my isolates, including 021 and 043, the number of SNPs was 119.

I know that running different groups of samples will change what is defined as 'core' but why do I get different results by using a subset of the original group and when using the original ref.gbk as opposed to snippy's .fa version of that file?

nthomson50 avatar Feb 05 '21 23:02 nthomson50

@nthomson50 running snippy with different sets of samples with the same reference will very likely change the core, and therefore the number of SNPs (as you suggest).

ref.fa is just a conversion of the ref.gbk to FASTA. So, I am not sure how number 3 could happen without a change in the samples. I think we would need more information to troubleshoot. Can you confirm that the sequence data in Sample001/ref.fa is in fact the same as the sequence in ref.gbk?

As to number 2, I suspect one or more of your outgroup samples are too distant --- which would mean the core genome is effectively zero.

andersgs avatar Apr 06 '21 23:04 andersgs