nextclade icon indicating copy to clipboard operation
nextclade copied to clipboard

ENH: VCF output

Open dpark01 opened this issue 2 years ago • 1 comments

Currently, nextclade CLI provides a nice tabular output file describing, for each genome, the observed non-synonymous SNPs and indels. Currently:

  • it does not describe synonymous SNPs
  • it has separate fields for SNPs vs indels
  • (most importantly) it does not provide information about missingness: absence of variant calls here cannot be interpreted as a reference allele

If I were to propose / request a new output file that described synonymous variants, distinguished missing data from reference alleles, etc, then (aside from transposing the table), it starts to sound a lot like a VCF file. And if nextclade CLI could produce a VCF or gVCF output file, it starts to become parseable by a bunch of other command line tools and python/perl/C APIs for free.

dpark01 avatar Feb 02 '22 18:02 dpark01

Thanks for the suggestion.

I'm not sure if you're aware that all the information needed to reconstruct the input and alignment fasta is available from the following fields:

  • substitutions
  • insertions
  • deletions
  • ambiguous
  • missing

and hence in theory, it's possible to transform the output into VCF (without aa mutation info).

By "not describing synonymous SNPs" do you mean that we don't say connect nucleotide and amino acid mutations in the output files? Is that part of a VCF file?

corneliusroemer avatar Feb 02 '22 19:02 corneliusroemer