augur clade only partially assigns clade information
Current Behavior
When running augur clade command the JSON file produced only has a partial list of assigned clades, with the remaining showing as "unassigned". When using the --reference option all branches are set to "unassigned"
Expected behavior
All branches should be correctly assigned with the clade information
How to reproduce
I'm using the following docker container: quay.io/biocontainers/augur:22.0.2--pyhdfd78af_0
With the following command: augur clades --tree kilifi_H3N2_new_docker_timetree.nwk --mutations kilifi_H3N2_new_docker_nt_muts.json kilifi_H3N2_new_docker_aa_muts.json --clades clades_h3n2_ha.tsv --output-node-data test_clades.json
Here are all the input and output files: augur_clade_input_output.zip
with the test_clades.json having the following content:
{
"branches": {
"NODE_0000006": {
"labels": {
"clade": "3C.2a"
}
},
"SRR11445940_A_HA_H3": {
"labels": {
"clade": "3C.2a1"
}
}
},
"generated_by": {
"program": "augur",
"version": "22.0.2"
},
"nodes": {
"100734_A_HA_H3": {
"clade_membership": "unassigned"
},
"100954_A_HA_H3": {
"clade_membership": "unassigned"
},
"109275_A_HA_H3": {
"clade_membership": "unassigned"
},
"109292_A_HA_H3": {
"clade_membership": "unassigned"
},
"109342_A_HA_H3": {
"clade_membership": "unassigned"
},
"109562_A_HA_H3": {
"clade_membership": "unassigned"
},
"109630_A_HA_H3": {
"clade_membership": "unassigned"
},
"109974_A_HA_H3": {
"clade_membership": "unassigned"
},
"110108_A_HA_H3": {
"clade_membership": "unassigned"
},
"115485_A_HA_H3": {
"clade_membership": "unassigned"
},
"115722_A_HA_H3": {
"clade_membership": "unassigned"
},
"115833_A_HA_H3": {
"clade_membership": "unassigned"
},
"115863_A_HA_H3": {
"clade_membership": "unassigned"
},
"116143_A_HA_H3": {
"clade_membership": "unassigned"
},
"116165_A_HA_H3": {
"clade_membership": "unassigned"
},
"116225_A_HA_H3": {
"clade_membership": "unassigned"
},
"116281_A_HA_H3": {
"clade_membership": "unassigned"
},
"116354_A_HA_H3": {
"clade_membership": "unassigned"
},
"116389_A_HA_H3": {
"clade_membership": "unassigned"
},
"124408_A_HA_H3": {
"clade_membership": "unassigned"
},
"124728_A_HA_H3": {
"clade_membership": "3C.2a"
},
"133124_A_HA_H3": {
"clade_membership": "3C.2a"
},
"133619_A_HA_H3": {
"clade_membership": "3C.2a"
},
"134526_A_HA_H3": {
"clade_membership": "3C.2a"
},
"134927_A_HA_H3": {
"clade_membership": "3C.2a"
},
"135010_A_HA_H3": {
"clade_membership": "3C.2a"
},
"135156_A_HA_H3": {
"clade_membership": "3C.2a"
},
"135379_A_HA_H3": {
"clade_membership": "3C.2a"
},
"135553_A_HA_H3": {
"clade_membership": "3C.2a"
},
"135676_A_HA_H3": {
"clade_membership": "3C.2a"
},
"92804_A_HA_H3": {
"clade_membership": "unassigned"
},
"93547_A_HA_H3": {
"clade_membership": "unassigned"
},
"94414_A_HA_H3": {
"clade_membership": "unassigned"
},
"99056_A_HA_H3": {
"clade_membership": "unassigned"
},
"NODE_0000000": {
"clade_membership": "unassigned"
},
"NODE_0000002": {
"clade_membership": "unassigned"
},
"NODE_0000003": {
"clade_membership": "unassigned"
},
"NODE_0000005": {
"clade_membership": "unassigned"
},
"NODE_0000006": {
"clade_membership": "3C.2a"
},
"NODE_0000007": {
"clade_membership": "3C.2a"
},
"NODE_0000008": {
"clade_membership": "3C.2a"
},
"NODE_0000010": {
"clade_membership": "3C.2a"
},
"NODE_0000011": {
"clade_membership": "3C.2a"
},
"NODE_0000012": {
"clade_membership": "3C.2a"
},
"NODE_0000013": {
"clade_membership": "3C.2a"
},
"NODE_0000016": {
"clade_membership": "3C.2a"
},
"NODE_0000017": {
"clade_membership": "3C.2a"
},
"NODE_0000018": {
"clade_membership": "unassigned"
},
"NODE_0000019": {
"clade_membership": "unassigned"
},
"NODE_0000020": {
"clade_membership": "unassigned"
},
"NODE_0000021": {
"clade_membership": "unassigned"
},
"NODE_0000023": {
"clade_membership": "unassigned"
},
"NODE_0000025": {
"clade_membership": "unassigned"
},
"NODE_0000028": {
"clade_membership": "unassigned"
},
"NODE_0000029": {
"clade_membership": "unassigned"
},
"NODE_0000030": {
"clade_membership": "unassigned"
},
"NODE_0000032": {
"clade_membership": "unassigned"
},
"NODE_0000033": {
"clade_membership": "unassigned"
},
"NODE_0000034": {
"clade_membership": "unassigned"
},
"NODE_0000035": {
"clade_membership": "unassigned"
},
"SRR11445892_A_HA_H3": {
"clade_membership": "3C.2a"
},
"SRR11445940_A_HA_H3": {
"clade_membership": "3C.2a1"
},
"SRR11445941_A_HA_H3": {
"clade_membership": "3C.2a"
},
"SRR13443360_A_HA_H3": {
"clade_membership": "unassigned"
}
}
}
Your environment: if running Nextstrain locally
- Operating system:
- Browser:
- Version (e.g.
auspice 2.7.0):
Additional context
Add any other context about the problem here.
Hi @cimendes,
This is expected behavior of augur clades when the node does not have the amino acid and nucleotide mutations that match your clade definitions.
I suspect you need to update the coordinates within clades_h3n2_ha.tsv.
Currently, it is an exact copy of the H3N2 clades.tsv from the seasonal-flu repo, which was created based on the seasonal-flu repo's reference.fasta and genemap.gff.
If you look at the seasonal-flu's genemap.gff, it has different start/end coordinates than the coordinates listed for your reference in reference_h3n2_ha.gb.
Also note that the --reference option is not a supported feature yet. You should have seen this warning when you tried to use this option.
Although it is unexpected that using the --reference option affected your output, that sounds like a bug that should be fixed!
Just coming back to this issue:
- The samples we have are older H3N2s (2009-2015), and are just for training purposes. We wanted a good study, with raw reads available and some metadata.
- Here is a sample HA sequence: 109342_HA.fasta.zip
- From an explanation by @corneliusroemer, these older sdequences should get the clade "unassigned", which is what happens when I use nextclade web version and with the reference "CY163680".
- However, when I use the reference "EPI1857216", I get a 3C clade for that sample, which should be incorrect as the original paper reports clade 7.
- Shouldn't both references give the same clade output, or in which cases should one be used over the other?
Hi @jrotieno, the issue you are running into is slightly different. Nextclade uses a different algorithm for clade assignment that is separate from the augur clade command.
As noted in the Clade assignment section:
Nextclade assigns the clade of the nearest reference node found during the Phylogenetic placement step.
Since the two references use different reference trees, they could potentially assign different clades to the same sample.
in which cases should one be used over the other?
Others will definitely have more insight here, but older samples would require an older reference since they are aligned against the reference for mutation calling.