nextclade_data Add yellow fever virus dataset

Built from nextclade workflow in yellow fever repo.

Aug 02 '24 20:08 genehack

Thanks @genehack

I allowed myself to resolve merge conflict in minimizer_index.json (by deleting the file and rebuilding).

The dataset can be tested in web like this:

https://clades.nextstrain.org/?dataset-server=gh:@yellow-fever-dataset@/data_output&dataset-name=nextstrain/yellow-fever/prM-E

(this will also update when you push new changes, after bot pushes the rebuild, but make sure to refresh the browser bypassing cache or clear the cache)

The 2000+ example sequences pushing the limits a little and many of them don't align, producing bright red rows in the table - this scares new users sometimes when they try to run examples. Would be nice to select a smaller subset. Example sequence are for demo purposes only, and are sometimes convenient for quick testing and debugging (when software is modified). A few dozen or a hundred is a typical count - just to show off what clades we've got, some interesting science cases, etc.

Otherwise, technically looks good to me.

I will let science team to review the scientific aspect of the dataset - this is the hardest part.

Aug 02 '24 21:08 ivan-aksamentov

I allowed myself to resolve merge conflict in minimizer_index.json (by deleting the file and rebuilding).

Thanks!

The 2000+ example sequences pushing the limits a little and many of them don't align, producing bright red rows in the table - this scares new users sometimes when they try to run examples. Would be nice to select a smaller subset.

Ah, I think I misunderstood the purpose of sequences.fasta — what I have included is the entire NCBI Dataset for yellow fever — I can certainly go through and try to filter it down to ~50 representative sequences instead.

Otherwise, technically looks good to me.

🙌

I will let science team to review the scientific aspect of the dataset - this is the hardest part.

Sure — I do have some updates to push, and I will provide a cut down set of examples; likely this will be early next week.

Aug 02 '24 22:08 genehack

@genehack

Ah, I think I misunderstood the purpose of sequences.fasta — what I have included is the entire NCBI Dataset for yellow fever

Ideally, set of examples and the set of samples that end up on reference tree to not overlap. If they are - then the examples will be placed onto the tree where they are already there. (It's like training a machine learning model and then testing it on training data - the results will be amazing, but it does not show how well the model performs in reality)

Aug 02 '24 23:08 ivan-aksamentov

Ideally, set of examples and the set of samples that end up on reference tree to not overlap.

I have force-pushed a new dataset with a sequences.fasta file containing 32 full-length yellow fever virus genomes, that has no overlap with the 122 samples used to build the nextclade tree.

Aug 05 '24 17:08 genehack

(I had to remove data_output/minimizer_index.json, then merge master and then rebuild to resolve merge conflict again. Please don't forget to pull if add more changes)

~One thing I noticed is that each example sequence is 672 nucs long and then there's 600+ nuc insertion at the 5' end and 9000+ nuc insertion at the 3' end. This means that the reference is 672 nucs long, but the example sequences are much longer. Not a by little bit, but like 15 times longer. This is not something I've seen before.~

Another interesting peculiarity I noticed is that even if there are up to 99 nuc mutations in "South America" clades, they end up only in up to 7 aa mutations. Meaning that most nuc mutations are silent.

Don't know much about this virus though. Maybe that's expected. But it won't hurt to check with the experts.

Aug 06 '24 01:08 ivan-aksamentov

Ah, I guess this answers my first concern :)

These two papers, collectively, define 7 distinct yellow fever virus genotypes based on a 670 nucleotide region of the yellow fever virus genome, (bases 641-1310), called the prM-E region. This dataset can be used to assign genotypes to any sequence that includes at least 500 bp of the prM-E region, including whole genome sequences. Sequence data beyond the prM-E region will be reported as an insertion in the Nextclade output.

Aug 06 '24 01:08 ivan-aksamentov

(I had to remove data_output/minimizer_index.json, then merge master and then rebuild to resolve merge conflict again. Please don't forget to pull if add more changes)

Thanks for fixing that up; sorry, I'm not used to needing to propagate changes from the remote back to my branch — branch is set to remote tracking now, so I'll get any new changes.

~One thing I noticed is that each example sequence is 672 nucs long and then there's 600+ nuc insertion at the 5' end and 9000+ nuc insertion at the 3' end. This means that the reference is 672 nucs long, but the example sequences are much longer. Not a by little bit, but like 15 times longer. This is not something I've seen before.~

Yeah, as you noted later -- the genotypes are defined in the literature on the basis of this short region towards the 5' end; the example sequences are all full genomes (because the first Nextstrain dataset I'm trying to build for the full genome, so those were the sequences I had handy.

Another interesting peculiarity I noticed is that even if there are up to 99 nuc mutations in "South America" clades, they end up only in up to 7 aa mutations. Meaning that most nuc mutations are silent.

It's clear from the literature (I can dig up exact cites if you're interested) that the nucleotide level divergence is much higher than the amino acid level; it's not clearly understood why but it's been noted in several papers I've read.

Aug 06 '24 16:08 genehack

The Readme right now reads like a workflow readme for workflow devs - instead it should be a readme for dataset users. Have a look at the other existing datasets to see the type of information usually included.

That's because it was the workflow readme — not sure how that happened, but I have provided the correct version.

Is there a reason you only use the prM gene? Why not whole genome? Is YFV so recombinant that a full genome tree would be misleading

Because the 2 papers that originally established the genotypes that are the basis of the clades only looked at that prM-E region. It seems to be a very similar situation to the N450 region of measles, in that it's a frequent sequencing target for probably historical reasons.

The clades are probably defined based on just a short subset of the genome because whole genome sequences weren't available due to expense. For Nextclade, there's usually no reason to not use a full tree.

The whole genome sequences are not systematically annotated with genotypes in the NCBI dataset. The purpose of building this Nextclade dataset from the reference sequences from the two papers linked in the README is so we can use it to assign genotypes to the full genome tree that's in Nextstrain staging. (N.b., that version is the previous build and still uses the geographically-based genotype names, not the clades that this dataset has been updated to use.)

Aug 22 '24 17:08 genehack

@corneliusroemer I think this is ready to merge — could you take a look, please.

Sep 12 '24 17:09 genehack

@corneliusroemer I think this is ready to merge — could you take a look, please.

@corneliusroemer I expect you're quite busy with Pathoplexus, but another look at this new dataset would be appreciated.

Sep 26 '24 17:09 genehack

Thanks for the ping, I'll have a look!

Sep 26 '24 18:09 corneliusroemer

Thanks for the ping, I'll have a look!

Hi @corneliusroemer any chance to have that look yet?

Oct 15 '24 21:10 genehack

Looks good, save a few more tweaks to parameters/typos

Thanks, tweaks applied and pushed.

I will plan to merge this Tuesday the 22nd in the absence of further feedback.

Oct 17 '24 17:10 genehack