[ancestral] reference seq may be reference or inferred tree root
Current Behavior
augur ancestral exports a "reference" sequence via <JSON>.reference.nuc. Depending on the usage this is:
- For VCF inputs this is simply the FASTA input
--vcf-reference - For FASTA inputs, if
--root-sequence(FASTA/GenBank) is provided then we read that (and error if we can't) - For FASTA inputs without
--root-sequencethen the root sequence is the inferred sequence at the root node. Salient code:
https://github.com/nextstrain/augur/blob/d35f8382f0c6640d7b3ffde78f41544e157f48b7/augur/ancestral.py#L157-L160
Note that the relationship between <JSON>.reference.nuc and the sequence at the root-node is correct in each of those cases. This is important as it implies the (JSON nuc) reference sequence is appropriate to use in the context of a nextclade dataset. See the following tests, respectively:
related issue See also #1361.
Expected behavior
We should clearly distinguish between reference & root-sequence in the JSON key names.
How to reproduce
See above cram tests
Possible solution
Only write <JSON>.reference in cases 1 & 2, where we know that it actually corresponds to a provided reference. For case (3) there can be no inferred mutations on the root node so export can just use the sequence attached to the root rather than simply node_data.reference as it does now.
Your environment: if running Nextstrain locally
augur 23.1.1