augur icon indicating copy to clipboard operation
augur copied to clipboard

WIP: New clades subcommand that works like traits, using labeled tips rather than clades.tsv

Open corneliusroemer opened this issue 2 years ago • 3 comments

Currently, to assign clades at internal nodes, we require a clades.tsv listing defining mutations for each clade which isn't a natural way to define clades in many cases: often people define clades through representative strains.

This draft PR is an attempt to offer an alternative clades command that does clade assignment more like trait inference/ancestral reconstruction: from a set of labeled tips to internal nodes and unlabeled tips.

Such a command will be particularly useful for bootstrapping reference trees for Nextclade datasets. However, there are many more use cases, e.g. getting lineages onto internal nodes in ncov.

Currently, the implementation is a stripped down version of traits with confidence/entropy/model output removed. However this is an implementation detail that will probably change - so it's best not to focus on that part.

It would be great to get feedback in general, but in particular on the following points: How should this command be included? Should it be a new subcommand - the way it's done right now with the place holder name clades2 eventually replaced by a better name, or should we put the functionality inside augur clades and gate it behind a --mode?

My gut preference is to make a new subcommand as the input files to the command are quite different: taking a metadata.tsv and a metadata column name instead of a nuc_mutations.json and a clades.tsv - but we might also want to avoid proliferation of new subcommands.

Some limitations of the current implementation:

  • clades can be non-monophyletic
  • hierarchy information is not taken into account (i.e. the internal node at a junction of A, A.1.1 and A.1.2 will not be A.1 but one of the other three). In theory, hierarchy could be taken into account but this can be added later
  • the current implementation cannot deal with more than 300 clades, this limitation is easily removed by using ancestral reconstruction, e.g. parsimony, instead of the current mugration model

corneliusroemer avatar Oct 26 '23 17:10 corneliusroemer

@corneliusroemer Can you say a little more about why the current augur traits command isn't a good solution to the problem? From this PR, it looks like some of the major differences between the proposed new command, clades, and traits are:

  • clades2 provides an option to define the output attribute name that traits lacks and that clades provides through --membership-name
  • neither clades2 nor traits provide branch attribute annotations that clades provides on the first internal node for each distinct clade
  • neither clades2 nor traits provide an argument to set clade labels that clades provides through the --label-name argument

Maybe another question is how important it is for the proposed new interface to exactly match the functionality provided by clades or traits. I could see value in providing confidence values for clade assignments in the same way that traits provides. I also see value in providing branch attributes in Auspice, so users get human-readable branch labels.

huddlej avatar Oct 26 '23 18:10 huddlej

Thanks for the good questions @huddlej!

It's true that one could use traits to do what the command does in the current state of the PR, but we will eventually want to use a different treetime function under the hood (ancestral reconstruction rather than mugration) and that means we can't use traits anymore - unless one were to make clades inference essentially a separate traits mode altogether.

Adding branch label functionality makes sense, but it's not essential for the main use cases I've thought of. Same for confidence, if possible nice to have but depending on the algorithm used for inference, we might not get confidence (e.g. Fitch/parsimony won't give you confidence).

corneliusroemer avatar Oct 26 '23 18:10 corneliusroemer

Thanks, @corneliusroemer. I see now how different this logic needs to be from augur traits. We recently discussed using Nextclade to assign clades in the seasonal flu workflows, too, which would require this kind of functionality you've proposed.

I agree with @victorlin's assessment in the comments above that placing this new functionality in augur clades conveys the shared objective of the command regardless of the different input modes. Using the same subcommand name also suggests that the command outputs will behave consistently across different modes. For example, annotating branch labels is a key feature of the current augur clades and we will need that feature for flu builds. On the other hand, "confidence" does not exists as a clades output that people depend on, so I like the idea of not including that in outputs unless it is necessary.

Do you think @victorlin's example interface would meet your needs as a user, @corneliusroemer? Is there anything you'd change from the UI perspective? We could chat about this synchronously any time you'd like, too, if that's easier than GitHub comments...

huddlej avatar Dec 01 '23 21:12 huddlej