augur icon indicating copy to clipboard operation
augur copied to clipboard

[ancestral] reference seq may be reference or inferred tree root

Open jameshadfield opened this issue 2 years ago • 2 comments

Current Behavior

augur ancestral exports a "reference" sequence via <JSON>.reference.nuc. Depending on the usage this is:

  1. For VCF inputs this is simply the FASTA input --vcf-reference
  2. For FASTA inputs, if --root-sequence (FASTA/GenBank) is provided then we read that (and error if we can't)
  3. For FASTA inputs without --root-sequence then the root sequence is the inferred sequence at the root node. Salient code:

https://github.com/nextstrain/augur/blob/d35f8382f0c6640d7b3ffde78f41544e157f48b7/augur/ancestral.py#L157-L160

Note that the relationship between <JSON>.reference.nuc and the sequence at the root-node is correct in each of those cases. This is important as it implies the (JSON nuc) reference sequence is appropriate to use in the context of a nextclade dataset. See the following tests, respectively:

  1. cram test
  2. cram test
  3. cram test

related issue See also #1361.

Expected behavior

We should clearly distinguish between reference & root-sequence in the JSON key names.

How to reproduce

See above cram tests

Possible solution

Only write <JSON>.reference in cases 1 & 2, where we know that it actually corresponds to a provided reference. For case (3) there can be no inferred mutations on the root node so export can just use the sequence attached to the root rather than simply node_data.reference as it does now.

Your environment: if running Nextstrain locally

augur 23.1.1

jameshadfield avatar Dec 18 '23 03:12 jameshadfield