nextclade icon indicating copy to clipboard operation
nextclade copied to clipboard

DOCS: More detailed phylogenetic placement docs

Open corneliusroemer opened this issue 2 years ago • 1 comments

I've noticed I'm not entirely sure myself how the phylogenetic placement algorithm works exactly and the docs don't contain the details I'm interested in.

Open questions:

  • Do we treat tails differently from the rest of the sequence? E.g. the first and last 100 bp? Gut feeling: No
  • How are Ns in the reference sequence handled? In particular, the equation in the docs only mentions Ns in the query sequence. Is there an implicit false assumption that reference sequences never contain Ns?
  • Can the (complicated) equation be simplified or at least explained in a simpler way? It seems not quite obvious what is being minimized.

To document:

  • How are ties resolved? -> Tree is iterated over pre order and the first is taken

Relevant code is for example here: https://github.com/nextstrain/nextclade/blob/c9ba8c26f60dba45c366cefa43953cbc5fd785c0/packages/nextclade/src/tree/treeFindNearestNodes.cpp#L56-L68

corneliusroemer avatar Mar 01 '22 15:03 corneliusroemer

Do we treat tails differently from the rest of the sequence? E.g. the first and last 100 bp? Gut feeling: No

No, I don't think so.

How are Ns in the reference sequence handled? In particular, the equation in the docs only mentions Ns in the query sequence. Is there an implicit false assumption that reference sequences never contain Ns?

I don't think we handle that. Reference sequence is expected to be a high-quality, complete sequence.

Can the (complicated) equation be simplified or at least explained in a simpler way? It seems not quite obvious what is being minimized.

The distance measures (dis-)similarity between a ref node and a query sequence in terms of mutations, and also tries to factor-in missing and ambiguous data.

The formula for the distance you see in the docs is implemented here, just a few lines above what you linked:

https://github.com/nextstrain/nextclade/blob/c9ba8c26f60dba45c366cefa43953cbc5fd785c0/packages/nextclade/src/tree/treeFindNearestNodes.cpp#L48

The comments in this function should help a bit. But it just takes counts of certain events in the sequence and them sums them together in an empirical way. Decisions were made, and it happened to work well in practice.

None of this is absolute, it was just figured out and then refined over time by Richard. As a scientist this is your field of work, so don't hesitate to experiment, and let me know if you see any improvements there. Richard should be able to give some background.

I don't exclude a possibility of introducing multiple distance metrics which can be chosen depending on a dataset or with a runtime flag, if that's helpful.

ivan-aksamentov avatar Mar 07 '22 12:03 ivan-aksamentov