augur icon indicating copy to clipboard operation
augur copied to clipboard

BUG: export validation unexpectedly fails with `... is not valid under any of the given schemas. Trace: properties - tree - oneOf`

Open corneliusroemer opened this issue 1 year ago • 4 comments

Current Behavior

When making a monkeypox build with custom sequences and spiked metadata, Bryan encountered the following strange export validation error:

Validating produced JSON
Validating schema of 'results/nebraska_hmpxv1/raw_tree.json'...
        ERROR: {'name': 'NODE_0000000', 'node_attrs': {'div': 0, 'num_date': {'value': 2022.267077937976, 'confidence': [2022.2577265025093, 2022.3751526984554]}}, 'branch_attrs': {'mutations': {}}, 'children': [{'name': 'OP171922', 'node_attrs': 
[... skip long JSON ...]
'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-08-29'}}, 'branch_attrs': {'mutations': {}}}]}]}]}]} is not valid under any of the given schemas. Trace: properties - tree - oneOf
Validation of 'results/nebraska_hmpxv1/raw_tree.json' failed.

When we add --skip-validation to export, the produced tree works exactly as expected, opens fine in Auspice.

Expected behavior

No validation error happens, or if it does, it explains better what the problem is. It's totally unintelligible, even to me as a heavy augur user. I looked at the metadata and it looks fine. Cannot possibly see what the issue could be.

How to reproduce

Steps to reproduce the current behavior:

  1. Download files to reproduce: augur_export_validation_bug.tzst.txt
  2. Untar: tar xf augur_export_validation_bug.tzst.txt --directory .
  3. Run the export command:
augur export v2             --tree results/nebraska_hmpxv1/tree.nwk             --metadata \
results/nebraska_hmpxv1/metadata.tsv             --node-data results/nebraska_hmpxv1/branch_lengths.json \
results/nebraska_hmpxv1/nt_muts.json results/nebraska_hmpxv1/aa_muts.json --auspice-config \
config/bryan_auspice_config_hmpxv1.json             --include-root-sequence       \
   --output results/nebraska_hmpxv1/raw_tree.json
  1. Observe result

Also, to observe that the tree is actually fine, add --skip-validation and look at the resulting tree in auspice.us

Version: augur 17.1.0

See full log
monkeypox on  master [!?] via 🅒 nextstrain 
❯        augur export v2             --tree results/nebraska_hmpxv1/tree.nwk             --metadata results/nebraska_hmpxv1/metadata.tsv             --node-data results/nebraska_hmpxv1/branch_lengths.json results/nebraska_hmpxv1/nt_muts.json results/nebraska_hmpxv1/aa_muts.json --auspice-config config/bryan_auspice_config_hmpxv1.json             --include-root-sequence             --output results/nebraska_hmpxv1/raw_tree.json
Validating schema of 'results/nebraska_hmpxv1/nt_muts.json'...
Validating schema of 'results/nebraska_hmpxv1/aa_muts.json'...
Validating config file config/bryan_auspice_config_hmpxv1.json against the JSON schema
Validating schema of 'config/bryan_auspice_config_hmpxv1.json'...
WARNING: You asked for a color-by for trait 'GA_CT_fraction', but it has no values on the tree. It has been ignored.

WARNING: You asked for a color-by for trait 'dinuc_context_fraction', but it has no values on the tree. It has been ignored.

WARNING: You asked for a color-by for trait 'recency', but it has no values on the tree. It has been ignored.

Validating produced JSON
Validating schema of 'results/nebraska_hmpxv1/raw_tree.json'...
        ERROR: {'name': 'NODE_0000000', 'node_attrs': {'div': 0, 'num_date': {'value': 2022.267077937976, 'confidence': [2022.2577265025093, 2022.3751526984554]}}, 'branch_attrs': {'mutations': {}}, 'children': [{'name': 'OP171922', 'node_attrs': {'div': 1, 'num_date': {'value': 2022.5328767123287, 'confidence': [2022.5328767123287, 2022.5328767123287]}, 'accession': 'OP171922', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-08-08'}}, 'branch_attrs': {'mutations': {'nuc': ['C172259T'], 'OPG200': ['S37L']}, 'labels': {'aa': 'OPG200: S37L'}}}, {'name': 'NODE_0000002', 'node_attrs': {'div': 0, 'num_date': {'value': 2022.270431228966, 'confidence': [2022.26070004096, 2022.3734683189246]}}, 'branch_attrs': {'mutations': {}}, 'children': [{'name': 'OP171923', 'node_attrs': {'div': 6, 'num_date': {'value': 2022.535616438356, 'confidence': [2022.535616438356, 2022.535616438356]}, 'accession': 'OP171923', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1.8'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-08-08'}}, 'branch_attrs': {'mutations': {'nuc': ['G5595A', 'G22643A', 'G63811A', 'G78034A', 'C80111T', 'T119985C'], 'OPG015': ['Q188*'], 'OPG038': ['H173Y'], 'OPG099': ['D124N']}, 'labels': {'aa': 'OPG015: Q188*; OPG038: H173Y; OPG099: D124N'}}}, {'name': 'NODE_0000006', 'node_attrs': {'div': 5, 'num_date': {'value': 2022.423423410786, 'confidence': [2022.3576842650348, 2022.4618766095477]}}, 'branch_attrs': {'mutations': {'nuc': ['G98233A', 'G98455A', 'G98456A', 'G111084A', 'C182950T'], 'OPG117': ['D729N'], 'OPG118': ['G4K'], 'OPG210': ['S532L']}, 'labels': {'aa': 'OPG117: D729N; OPG118: G4K; OPG210: S532L'}}, 'children': [{'name': 'OP171925', 'node_attrs': {'div': 7, 'num_date': {'value': 2022.5630136986301, 'confidence': [2022.5630136986301, 2022.5630136986301]}, 'accession': 'OP171925', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-08-08'}}, 'branch_attrs': {'mutations': {'nuc': ['C10484T', 'A61974G']}}}, {'name': 'NODE_0000007', 'node_attrs': {'div': 7, 'num_date': {'value': 2022.4808219178083, 'confidence': [2022.4408955734364, 2022.4808219178083]}}, 'branch_attrs': {'mutations': {'nuc': ['G102694A', 'G160484A']}}, 'children': [{'name': 'OP171920', 'node_attrs': {'div': 7, 'num_date': {'value': 2022.5054794520547, 'confidence': [2022.5054794520547, 2022.5054794520547]}, 'accession': 'OP171920', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-08-08'}}, 'branch_attrs': {'mutations': {}}}, {'name': 'OP171919', 'node_attrs': {'div': 7, 'num_date': {'value': 2022.4808219178083, 'confidence': [2022.4808219178083, 2022.4808219178083]}, 'accession': 'OP171919', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-08-08'}}, 'branch_attrs': {'mutations': {}}}]}]}, {'name': 'NODE_0000001', 'node_attrs': {'div': 4, 'num_date': {'value': 2022.53510821763, 'confidence': [2022.5347333887082, 2022.5616584710476]}}, 'branch_attrs': {'mutations': {'nuc': ['C18133T', 'G67611A', 'G130231A', 'G159277A'], 'OPG185': ['E121K']}, 'labels': {'aa': 'OPG185: E121K'}}, 'children': [{'name': 'OP314960', 'node_attrs': {'div': 5, 'num_date': {'value': 2022.5904109589042, 'confidence': [2022.5904109589042, 2022.5904109589042]}, 'accession': 'OP314960', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-08-29'}}, 'branch_attrs': {'mutations': {'nuc': ['C94099T'], 'OPG113': ['A790V']}, 'labels': {'aa': 'OPG113: A790V'}}}, {'name': 'OP431825', 'node_attrs': {'div': 4, 'num_date': {'value': 2022.5986301369862, 'confidence': [2022.5986301369862, 2022.5986301369862]}, 'accession': 'OP431825', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-09-14'}}, 'branch_attrs': {'mutations': {}}}, {'name': 'OP314958', 'node_attrs': {'div': 7, 'num_date': {'value': 2022.5821917808219, 'confidence': [2022.5821917808219, 2022.5821917808219]}, 'accession': 'OP314958', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-08-29'}}, 'branch_attrs': {'mutations': {'nuc': ['C169036T', 'C179853T', 'C189459T']}}}, {'name': 'MPXV22/human/USA-NE-127', 'node_attrs': {'div': 4, 'num_date': {'value': 2022.5986301369862, 'confidence': [2022.5986301369862, 2022.5986301369862]}, 'accession': 'MPXV22/human/USA-NE-127', 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-09-09'}}, 'branch_attrs': {'mutations': {}}}, {'name': 'OP314957', 'node_attrs': {'div': 4, 'num_date': {'value': 2022.5684931506848, 'confidence': [2022.5684931506848, 2022.5684931506848]}, 'accession': 'OP314957', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-08-29'}}, 'branch_attrs': {'mutations': {}}}]}, {'name': 'NODE_0000010', 'node_attrs': {'div': 1, 'num_date': {'value': 2022.5328767123287, 'confidence': [2022.4750499078364, 2022.5328767123287]}}, 'branch_attrs': {'mutations': {'nuc': ['G190660A'], 'NBT03_gp174': ['R84K']}, 'labels': {'aa': 'NBT03_gp174: R84K'}}, 'children': [{'name': 'OP171924', 'node_attrs': {'div': 1, 'num_date': {'value': 2022.5630136986301, 'confidence': [2022.5630136986301, 2022.5630136986301]}, 'accession': 'OP171924', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1.3'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-08-08'}}, 'branch_attrs': {'mutations': {}}}, {'name': 'OP314959', 'node_attrs': {'div': 2, 'num_date': {'value': 2022.5876712328768, 'confidence': [2022.5876712328768, 2022.5876712328768]}, 'accession': 'OP314959', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1.3'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-08-29'}}, 'branch_attrs': {'mutations': {'nuc': ['C119884T'], 'OPG137': ['P95S']}, 'labels': {'aa': 'OPG137: P95S'}}}, {'name': 'OP431826', 'node_attrs': {'div': 1, 'num_date': {'value': 2022.6123287671232, 'confidence': [2022.6123287671232, 2022.6123287671232]}, 'accession': 'OP431826', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1.3'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-09-14'}}, 'branch_attrs': {'mutations': {}}}, {'name': 'MPXV22/human/USA-NE-154', 'node_attrs': {'div': 1, 'num_date': {'value': 2022.6123287671232, 'confidence': [2022.6123287671232, 2022.6123287671232]}, 'accession': 'MPXV22/human/USA-NE-154', 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-09-09'}}, 'branch_attrs': {'mutations': {}}}, {'name': 'OP171921', 'node_attrs': {'div': 1, 'num_date': {'value': 2022.5328767123287, 'confidence': [2022.5328767123287, 2022.5328767123287]}, 'accession': 'OP171921', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1.3'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-08-08'}}, 'branch_attrs': {'mutations': {}}}, {'name': 'NODE_0000015', 'node_attrs': {'div': 3, 'num_date': {'value': 2022.609589041096, 'confidence': [2022.55718007742, 2022.609589041096]}}, 'branch_attrs': {'mutations': {'nuc': ['G55133A', 'C64426T'], 'OPG074': ['R665C']}, 'labels': {'aa': 'OPG074: R665C'}}, 'children': [{'name': 'OP431827', 'node_attrs': {'div': 3, 'num_date': {'value': 2022.6205479452055, 'confidence': [2022.6205479452055, 2022.6205479452055]}, 'accession': 'OP431827', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1.3'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-09-14'}}, 'branch_attrs': {'mutations': {}}}, {'name': 'MPXV22/human/USA-NE-199', 'node_attrs': {'div': 3, 'num_date': {'value': 2022.6342465753426, 'confidence': [2022.6342465753426, 2022.6342465753426]}, 'accession': 'MPXV22/human/USA-NE-199', 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-09-09'}}, 'branch_attrs': {'mutations': {}}}, {'name': 'MPXV22/human/USA-NE-181', 'node_attrs': {'div': 3, 'num_date': {'value': 2022.6260273972603, 'confidence': [2022.6260273972603, 2022.6260273972603]}, 'accession': 'MPXV22/human/USA-NE-181', 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-09-09'}}, 'branch_attrs': {'mutations': {}}}, {'name': 'OP431829', 'node_attrs': {'div': 3, 'num_date': {'value': 2022.6342465753426, 'confidence': [2022.6342465753426, 2022.6342465753426]}, 'accession': 'OP431829', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1.3'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-09-14'}}, 'branch_attrs': {'mutations': {}}}, {'name': 'MPXV22/human/USA-NE-176', 'node_attrs': {'div': 3, 'num_date': {'value': 2022.6205479452055, 'confidence': [2022.6205479452055, 2022.6205479452055]}, 'accession': 'MPXV22/human/USA-NE-176', 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-09-09'}}, 'branch_attrs': {'mutations': {}}}, {'name': 'OP431828', 'node_attrs': {'div': 3, 'num_date': {'value': 2022.6260273972603, 'confidence': [2022.6260273972603, 2022.6260273972603]}, 'accession': 'OP431828', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1.3'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-09-14'}}, 'branch_attrs': {'mutations': {}}}, {'name': 'OP314961', 'node_attrs': {'div': 3, 'num_date': {'value': 2022.609589041096, 'confidence': [2022.609589041096, 2022.609589041096]}, 'accession': 'OP314961', 'author': {'author': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.', 'value': 'Tegomoh,B., Cross,S.T., Chapman,R.C., Bernhard,K., McCutchen,E.L., Fauver,J.R., Pratt,C.B., Warden,D.E., Iwen,P.C., Donahue,M., Wiley,M.R.'}, 'Outbreak_Associated_2022': {'value': 'Yes'}, 'outbreak': {'value': 'hMPXV-1'}, 'country': {'value': 'USA'}, 'region': {'value': 'North America'}, 'sample_collection_year': {'value': 2022}, 'lineage': {'value': 'B.1.3'}, 'host': {'value': 'Homo sapiens'}, 'division': {'value': 'Nebraska'}, 'Site': {'value': 'Nebraska-USA'}, 'date_submitted': {'value': '2022-08-29'}}, 'branch_attrs': {'mutations': {}}}]}]}]}]} is not valid under any of the given schemas. Trace: properties - tree - oneOf
Validation of 'results/nebraska_hmpxv1/raw_tree.json' failed.

------------------------
Validation of results/nebraska_hmpxv1/raw_tree.json failed. Please check this in a local instance of `auspice`, as it is not expected to display correctly. 
------------------------

I've gotten rid of unnecessary node data - as that was not the root cause. That's why there are validation warnings - but they don't have anything to do with the bug here.

corneliusroemer avatar Sep 16 '22 18:09 corneliusroemer

The validation errors are unfortunately poor here because of bad support for the JSON Schema oneOf type in the Python library we're using to validate. The actual errors don't percolate up in a useful way, so it produces a generic error. If I modify the schema to work around that

diff --git a/augur/data/schema-export-v2.json b/augur/data/schema-export-v2.json
index edf5b740..28191a3e 100644
--- a/augur/data/schema-export-v2.json
+++ b/augur/data/schema-export-v2.json
@@ -323,17 +323,7 @@
                 }
             }
         },
-        "tree": {
-            "description": "One or more phylogenies using a nested JSON structure",
-            "oneOf": [
-                {"$ref": "#/$defs/tree"},
-                {
-                    "type": "array",
-                    "minItems": 1,
-                    "items": {"$ref": "#/$defs/tree"}
-                }
-            ]
-        }
+        "tree": {"$ref": "#/$defs/tree"}
     },
     "$defs": {
         "tree": {

to see more specifically what's wrong, I get:

Validating schema of 'results/nebraska_hmpxv1/raw_tree.json'...
        ERROR: 'MPXV22/human/USA-NE-127' does not match '^[0-9A-Za-z-_.]+$'. Trace: ... - properties - node_attrs - properties - accession - pattern
        ERROR: 'MPXV22/human/USA-NE-154' does not match '^[0-9A-Za-z-_.]+$'. Trace: ... - properties - node_attrs - properties - accession - pattern
        ERROR: 'MPXV22/human/USA-NE-176' does not match '^[0-9A-Za-z-_.]+$'. Trace: ... - properties - node_attrs - properties - accession - pattern
        ERROR: 'MPXV22/human/USA-NE-181' does not match '^[0-9A-Za-z-_.]+$'. Trace: ... - properties - node_attrs - properties - accession - pattern
        ERROR: 'MPXV22/human/USA-NE-199' does not match '^[0-9A-Za-z-_.]+$'. Trace: ... - properties - node_attrs - properties - accession - pattern
Validation of 'results/nebraska_hmpxv1/raw_tree.json' failed.

I think there's a discussion to be had for how much we should or shouldn't be trying to constrain property values, but putting that aside for now, this is erroring because the accession field isn't actually an accession.

The tree should also be included in the tarball

For anyone playing the home game, it wasn't, so I re-ran the augur export v2 command above. No big deal!

tsibley avatar Sep 16 '22 19:09 tsibley

Thanks for investigating @tsibley

If we cannot communicate something to the user, we shouldn't throw an error in validation on something like "accession".

That just asks for user headaches for no reason. I think the status quo is not acceptable.

Two options: a) We report helpful error details that allow user debugging b) We remove validation that cannot be output - especially when nothing would break.

Looks like we should switch the Python library for validation given that the information is there in theory.

Then we can be tight with validation and add a note that one can skip it using --skip-validation.

This issue has cost Bryan hours so I feel quite strongly about it.

corneliusroemer avatar Sep 16 '22 20:09 corneliusroemer

@corneliusroemer I broadly agree. The validation aspect of augur export is well-intentioned but I think poor (and here, actively bad) in practice. It could be better and actually helpful, but that will take work. We should stop any harm first if improving it will take too much time.

Stepping back a bit: augur export still produces the output file when validation fails, right? so the blocking issue is that once embedded in a workflow, the validation failure causes the whole workflow to abort? Separate from improving the validation, I wonder if augur export's default behaviour should be to perform validation and emit error messages (which again we can improve separately), but not exit with error only because of a validation error. Other validation modes could be optional, like the existing skip or strict behaviour, e.g.

--validation=skip|none   # aliased as --skip-validation
--validation=error       # like current behaviour
--validation=warn        # new default behaviour I describe above

tsibley avatar Sep 16 '22 23:09 tsibley

Just wanted to leave this here: https://github.com/python-jsonschema/jsonschema/issues/977

Seems like other people have similar issues with oneOf, there may be workarounds.

I don't know what errors validation catches. I'd worry about things slipping through if switching it totally off.

But maybe that worry is unfounded.

If we could figure out what the really bad errors are that we should error on, and what we can let through that would be great. But I fear the schema only knows valid or invalid.

I'll think about it more, have a look at the schema and what it would catch.

corneliusroemer avatar Sep 16 '22 23:09 corneliusroemer

With #1134, the terrible validation error message in the issue description above becomes:

Validating schema of '/home/tom/Downloads/results/nebraska_hmpxv1/raw_tree.json'...
  .tree {"name": "NODE_0000000", "node_attrs": {"div": 0…} failed oneOf validation for [{"$ref": "#/$defs/tree"}, {"type": "array", "minItems": 1, "items": {"$ref": "#/$defs/tree"}}]
    validation for arm 0: {"$ref": "#/$defs/tree"}
      .tree.children[…].node_attrs.accession "MPXV22/human/USA-NE-199" failed pattern validation for "^[0-9A-Za-z-_.]+$"
      .tree.children[…].node_attrs.accession "MPXV22/human/USA-NE-181" failed pattern validation for "^[0-9A-Za-z-_.]+$"
      .tree.children[…].node_attrs.accession "MPXV22/human/USA-NE-176" failed pattern validation for "^[0-9A-Za-z-_.]+$"
      .tree.children[…].node_attrs.accession "MPXV22/human/USA-NE-127" failed pattern validation for "^[0-9A-Za-z-_.]+$"
      .tree.children[…].node_attrs.accession "MPXV22/human/USA-NE-154" failed pattern validation for "^[0-9A-Za-z-_.]+$"
    validation for arm 1: {"type": "array", "minItems": 1, "items": {"$ref": "#/$defs/tree"}}
      .tree {"name": "NODE_0000000", "node_attrs": {"div": 0…} failed type validation for "array"
FATAL ERROR: Validation of '/home/tom/Downloads/results/nebraska_hmpxv1/raw_tree.json' failed.

A big improvement in my eyes! Note that I've not manually elided any of the error message as @corneliusroemer had to do with [... skip long JSON ...] in the original error.

There's aspects of this I think we could improve further, but I want to get out this initial improvement first.

tsibley avatar Jan 24 '23 18:01 tsibley