auspice icon indicating copy to clipboard operation
auspice copied to clipboard

Zoom to node by node name in the label URL param

Open sidneymbell opened this issue 4 years ago • 30 comments

Context
As I'm writing the weekly situation reports, it would be a huge help to be able to deep link to a view of the tree that's zoomed into any arbitrary node, rather than having to define a specific clade label to highlight some portion of the data. I currently waste a lot of time figuring out how to hack around this :)

Description
Currently, we use deep linking to enable sharing of a specific view of the data. I'd like to add one additional field to this to enable sharing a view that's zoomed to any specific node in the tree. This change doesn't need to be in the UI, just the URL.

Example
I don't know how the zooming is currently implemented in the codebase, but I'd imagine one option could be something like:

1 - I click on a node to zoom into it
2 - Under the hood, that node's name attribute (unique, already included in the JSON) is added to state. Ex: {'node_zoom': 'NODE_0002547'}
3 - That node ID is added to the URL, something like nextstrain.org/ncov?node_zoom=0002547 4 - I then share that URL, and this replicates the app's zoom state for the next viewer.

sidneymbell avatar Apr 01 '20 23:04 sidneymbell

I'm guessing this is at least theoretically possible, given that zooming is available via https://nextstrain.org/ncov/2020-04-01?label=clade:A2a, etc, although this might actually be acting as a filter / would maybe require a flag assigned to each tip. Not sure if this would be a perf issue.

sidneymbell avatar Apr 01 '20 23:04 sidneymbell

As I said on slack, I really want to avoid this, but I do have an alternative. Comments below adapted from slack messages:

I designed zoom not to be able to zoom into an internal node via the node name as it changes between builds -- NODE_0002547 will represent different clades. This is dangerous, as the URL hasn't changed & these get shared. (I realise the narrative datasets we are using for nCoV sit-reps will not change, but that's not typical.)

A potential auspice solution would be something along the lines of allowing clade zooming via ?label=mrca(<tipName1>,<tipName2>).

A non auspice solution would be to manually add a branch label to the nodes you want to zoom into for each narrative. Or write a script to label every single internal node for these datasets with a label of "internal_node" which will then be usable via ?label=internal_node:X (be aware this would increase the JSON size, but perhaps not too much with compression). You can choose the label key & value (i.e. it doesn't have to be "clade"), and you could hide it from view.

jameshadfield avatar Apr 01 '20 23:04 jameshadfield

That makes sense, thanks for clarifying @jameshadfield (missed this message, apologies).

Is there no way to make the node names (or another field) unique (e.g., generate them randomly or append a hash of the build date), such that if it's the wrong dataset, it just omits the zoom?

sidneymbell avatar Apr 01 '20 23:04 sidneymbell

Re: alternatives, thanks for thinking through this. However, while these are nice workarounds, I don't think they'd quite meet the need

  • the label=mrca(a,b,c) will very quickly get too long for deep linking if used for large clades
  • I can certainly label the JSON myself, but that would require either (A) changing the JSON for the official Nextstrain build, or (B) creating a separate archive of ncov JSONs and running on a community build, which I'd like to avoid.

I also think this feature would be very widely useful, beyond just my needs for narratives.

sidneymbell avatar Apr 02 '20 00:04 sidneymbell

In a strictly bifurcating & rooted phylogeny, any clade will always be definable as the common ancestor of two nodes. Thus, I think mrca(a,b) should work for nextstrain trees and is generalisable.

jameshadfield avatar Apr 02 '20 00:04 jameshadfield

Ah, yes, you are right. Although again, I think this would be an incredibly useful feature for others.

While I'm very comfortable thinking in terms of the MRCA, many other users are not. Relying upon this kind of implementation, as opposed to just generating the node names slightly differently, would make it very indiscoverable and difficult to use.

On Wed, Apr 1, 2020 at 5:19 PM james hadfield [email protected] wrote:

In a strictly bifurcating & rooted phylogeny, any clade will always be definable as the common ancestor of two nodes. Thus, I think mrca(a,b) should work for nextstrain trees and is generalisable.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextstrain/auspice/issues/1036#issuecomment-607552393, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAIYX4SHCUWDWUHZA7NJODRKPKZXANCNFSM4LZWE7GQ .

sidneymbell avatar Apr 02 '20 00:04 sidneymbell

You are right that it would be widely used, which would result in URLs shared (e.g. twitter) which zoomed into completely different parts of the tree depending on which day you looked!

I understand that my solution would be harder to use than just clicking on a branch, but this is much preferable to the dangers of relying on internal node names. If you want to pursue the ?label=mrca(a,b) solution feel free to create an issue. It's an interesting problem to implement and I imagine the community could come up with something in the short term. I could also envisage a UI feature to simplify the use of this, e.g. shift-clicking on 2 tips in succession zoomed into the clade & updated the URL accordingly.

jameshadfield avatar Apr 02 '20 01:04 jameshadfield

Hey @jameshadfield -- having noodled on this some more, I think I have a compromise that could address the concern ("don't populate twitter with a bunch of mis-zoomed and misleading tree links") and the struggle described in the issue.

1 - In the augur build, we tweak the way that node.name is assigned to include a short prefix that is unique to the date of the build (i.e., hash* the date -- XXXX). This would yield a minor increase in the size of the JSON. 2 - I add node_zoom=XXXX002547 to the URL 3 - Someone visits nextstrain.org/ncov?node_zoom=XXXX002547 4 - If that node is found in the tree, the app's zoom state is replicated for the next viewer
5 - If the build has been rerun the next day, the node will not be found in the tree as the prefix has changed, and that argument is ignored (defaulting to the full tree)

* one option would be to truncate the SHA1 hash or something similar; I'd be happy to help with this part

What are your thoughts? :)

sidneymbell avatar Apr 30 '20 19:04 sidneymbell

I just ran face-first into this issue, creating a Narrative to share internally w/ public health. While I can hack the json for now, I would prefer a more stable and simple solution.

Both the mrca solution and the node_zoom solution require the ability to easily identify an internal node. Having located a desired zoom by browsing in auspice manually, I would like to be able to easily capture the information I need to generate the appropriate zoom for the Narrative. Either a way to see the name of the node or to programmatically dump the relevant leaves for which it is a mrca. (I'm currently navigating an example and having trouble finding the node, as it happens not to have any unique mutations I can grep for.)

Neither node names as currently generated nor mrcas are future-proof, because new trees are being generated with different subsets of the available data on GISAID (or privately). A mrca call may be missing one or both of the leaves in a future build.

A node_zoom where the node names include a hash specific to the run, as @sidneymbell proposed, would work for me, so long as I could easily locate the node name.

batson avatar Apr 30 '20 20:04 batson

@batson -- Would it work for you if the URL was just updated automatically as you browse and zoom on the tree? We currently do this for most other attributes of the viz, such that sharing is copy-paste.

sidneymbell avatar Apr 30 '20 20:04 sidneymbell

@sidneymbell, that would be perfect. give the nodes names you're okay exposing, and then just keep that state in the URL. it's really nice to be able to copy/paste that to share with others (and with my future self).

batson avatar Apr 30 '20 21:04 batson

Currently internal nodes are sequentially named by TreeTime as NODE_%07d starting from NODE_0000000. These are a minimum of ~11~ 12 characters long and of course not unique.

We could very easily generate globally unique node names instead by generating UUIDv4s and representing them in base 62, e.g.:

from bases import Bases
from uuid import uuid4
Bases().toBase62(uuid4().int)

which produces ~22 character long strings like 2Q58jzs6SkqSGcs8I8YuUe.

Augur does appear to rely on the NODE_ prefix in one place, so we could keep the prefix, choose a shorter prefix, or adjust how that code identifies internal nodes. I'm not sure why the input alignment to that code would include internal nodes in the first place, so maybe it's a holdover from dev/prototyping and not actually necessary in practice.

In terms of compat with previous outputs though, probably best to keep the NODE_ prefix, as in NODE_2Q58jzs6SkqSGcs8I8YuUe.

tsibley avatar Apr 30 '20 22:04 tsibley

Oh my, that's a far simpler and more elegant suggestion, Tom :)

(Sidenote: I'm guessing the alignment potentially including internal nodes is related to ancestral state reconstructions)

sidneymbell avatar Apr 30 '20 22:04 sidneymbell

Just bumping the idea that having the name of the selected node for zoom in the URL would be useful--I just unclicked from something in the browser and really had trouble finding the node again. Had it been the URL I could have used the back button.

This might have a conflict with the nice animations moving between states cc @colinmegill . Would those still work if the url was changing?

batson avatar May 02 '20 02:05 batson

Just as a note - the URL changes should be possible - we already do this with nodes labelled as clades, so the functionality is there. However, this would need changes in Treetime (to name internal nodes in a unique way) (and we'd want to be a little cautious here - Treetime has used this naming structure for a quite a while, so we might unintentionally break things for other users - we might want to make this an option one can 'turn on' for unique naming).

We'd also then need to change the functionality for auspice, to treat all internal nodes essentially like clades. Again, we'll need to think this through as we need it to not conflict with clade labels - what do we do if a node has both? We may want to make 'show node labels in URL' a toggleable option, too.

Finally, the URL changes are all processed within auspice rather than as page reloads, so unfortunately they will not work with the back-button functionality! (As they currently do not)

emmahodcroft avatar May 02 '20 10:05 emmahodcroft

Finally, the URL changes are all processed within auspice rather than as page reloads, so unfortunately they will not work with the back-button functionality! (As they currently do not)

Whoops! Thank you for the clarification.

batson avatar May 02 '20 18:05 batson

I think this proposal should work & will be a very useful addition 👍. My recommendations for the implementation:

  • Define a node-name-regex (e.g. NODE_ + n character hash or similar) within the JSON schema as defined in the augur repo. If nodes are in this format it indicates they will not appear in future (or past) trees.
  • Modify augur refine to output internal nodes in this format
  • Implement a check in the JSON parsing within auspice to decide if a tree uses one-time node names as defined above.
  • Clicking on branches in auspice updates the URL via label=node:NODE_<hash>, similar to label names. This would mean enforcing (via the schema) that we can't have a label called node.
  • If the branch clicked on has a label defined (e.g. "clade"), then that should become part of the URL not the node name.
  • On page reloading, if the node is not found in the dataset then the URL query is silently deleted. If it exists, we zoom the tree (leveraging the machinery already there for zoom-via-label).

the URL changes are all processed within auspice ... so unfortunately they will not work with the back-button functionality

This is beacues most of the time we replace rather than push the URL state, but not always -- e.g. dataset changes are all within the same auspice app but create history that is navigable via the browser back button.

jameshadfield avatar May 04 '20 05:05 jameshadfield

Wonderful, thanks so much everyone! I'll take a first pass at implementing some of this and tag people in as I get stuck.

Cheers,

SB

On Sun, May 3, 2020 at 10:25 PM james hadfield [email protected] wrote:

I think this proposal should work & will be a very useful addition 👍. My recommendations for the implementation:

  • Define a node-name-regex (e.g. NODE_ + n character hash or similar) within the JSON schema as defined in the augur repo. If nodes are in this format it indicates they will not appear in future (or past) trees.
  • Modify augur refine to output internal nodes in this format
  • Implement a check in the JSON parsing within auspice to decide if a tree uses one-time node names as defined above.
  • Clicking on branches in auspice updates the URL via label=node:NODE_, similar to label names https://nextstrain.github.io/auspice/advanced-functionality/view-settings#url-query-options. This would mean enforcing (via the schema) that we can't have a label called node.
  • If the branch clicked on has a label defined (e.g. "clade"), then that should become part of the URL not the node name.
  • On page reloading, if the node is not found in the dataset then the URL query is silently deleted. If it exists, we zoom the tree (leveraging the machinery already there for zoom-via-label).

the URL changes are all processed within auspice ... so unfortunately they will not work with the back-button functionality

This is beacues most of the time we replace rather than push the URL state https://github.com/nextstrain/auspice/blob/master/src/middleware/changeURL.js#L229-L233, but not always -- e.g. dataset changes are all within the same auspice app but create history that is navigable via the browser back button.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextstrain/auspice/issues/1036#issuecomment-623263664, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAIYX4RT2QMYJ4WPT67E5DRPZGWPANCNFSM4LZWE7GQ .

sidneymbell avatar May 04 '20 21:05 sidneymbell

So, clearly this fell off my to-do list. Sorry, folks.

I did spend quite a bit of time considering an implementation approach, though. I'd propose the following:

  • In augur: Add a top-level entry to the exported JSON: {dataset_id: <defaults to timestamp>, ...}

  • In some reducer in auspice: state.dataset_id = json.dataset_id ? null

  • When a node is selected:

    • if !node.isclade & dataset_id --> add label=datasetID_nodename to URL
    • else: current behavior
  • When a URL is shared with a label query in it:

    • if clade is specified, follow current clade behavior and remove label query
    • if dataset_id matches and node name is in the tree --> zoom to node
    • else: remove the label query

I should have more bandwidth to revisit this once I return from vacation July 15. I'm happy to prioritize this then. But, if someone else has time to work on this sooner, that'd be amazing.

sidneymbell avatar Jun 27 '20 00:06 sidneymbell

I don't think a dataset id is necessary if we use the method I outlined above for assigning single-use universally unique names to internal nodes. The benefit of this approach is that its transparent to most Nextstrain components, i.e. nothing needs to know about a special dataset id.

On the other hand, adding a single-use dataset id does mean that internal node names don't need to change, however more pieces of the stack then need to know about this new "dataset id" field and how to handle it.

tsibley avatar Jul 01 '20 21:07 tsibley

Yeah for sure. My main thought was to avoid increasing the json file size too much, but if we're not worried about that, I'm all for the uuids :)

On Wed, Jul 1, 2020, 2:36 PM Thomas Sibley [email protected] wrote:

I don't think a dataset id is necessary if we use the method I outlined above for assigning single-use universally unique names to internal nodes. The benefit of this approach is that its transparent to most Nextstrain components, i.e. nothing needs to know about a special dataset id.

On the other hand, adding a single-use dataset id does mean that internal node names don't need to change, however more pieces of the stack then need to know about this new "dataset id" field and how to handle it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextstrain/auspice/issues/1036#issuecomment-652658528, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAIYX6TWTTLT5L7XSFV4LTRZOT4LANCNFSM4LZWE7GQ .

sidneymbell avatar Jul 02 '20 01:07 sidneymbell

My main thought was to avoid increasing the json file size too much, but if we're not worried about that, I'm all for the uuids :)

Ah, good consideration! Some quick back of the envelope math:

The current ncov_global.json contains 2,552 internal nodes (name starts with NODE_). Those names are currently 12 characters long. The new UUID-based names would be ~27 characters long. In UTF-8, those chars are 1 byte each.

  2,552 internal nodes
× (27 byte new names − 12 byte old names)
/ 1,000 bytes per KB
= 38KB increase

Out of a 31MB file (uncompressed), 38KB seems like no big deal. That said, the base 62 numbers are not as compressible. Empirical testing shows a slightly larger inflation:

tom@whunk nextstrain [1] $ du -B KB ncov_global*
31818kB ncov_global.json
31855kB ncov_global_uuids.json
tom@whunk nextstrain [1] $ gzip ncov_global*
tom@whunk nextstrain [1] $ du -B KB ncov_global*
1446kB  ncov_global.json.gz
1491kB  ncov_global_uuids.json.gz

But that still seems ok.

tsibley avatar Jul 02 '20 23:07 tsibley

Thanks everyone for thinking through this. I like the UUID node name strategy. Very clever. I agree with the strategy outlined by James in the May 3 post: https://github.com/nextstrain/auspice/issues/1036#issuecomment-623263664 as the general way to tackle things.

trvrb avatar Jul 02 '20 23:07 trvrb

Hi! I am a bioinformatician who just started working for the department of health in Hawaii a few weeks ago. I have ran a few local nextstrain builds that include data not submitted to GISAID and started writing a few narratives for a larger audience (public health officials, epidemiologists). The zoom feature that Sidney requested in this issue seems essential to communicate observations about particular clusters that are buried in the tree.

I was wondering if anybody had the chance to implement this feature, as suggested by @jameshadfield on April 1st and detailed by @tsibley on Jul 1-2. Looking at the ncov_global.json as it is today on data.nextstrain.org:

                        "name": "NODE_0002463",
                        "node_attrs": {
                          "clade_membership": {
                            "value": "19A"
                          },

it seems the answer is no, or that the feature hasn't made it to production.

I could give it a shot, but I don't want to duplicate any existing effort. Regards, Razvan

razsultana avatar Sep 28 '20 19:09 razsultana

Hi @razsultana -- as far as I'm aware no progress has been made on an implementation. For an immediate solution you could manually label the nodes you wish to zoom to in the tree (see this comment).

jameshadfield avatar Sep 28 '20 20:09 jameshadfield

Thanks, @jameshadfield ! Adding an "internal_node:my_node" label is a very quick fix and it works, but editing a 85MB json file can be tricky (finding the exact branch that you want modified).

In fact, I used an already existing branch attribute - the "mutations": {"nuc":["A20268G"]} one (which is also part of the definition of an emerging clade) - to find that particular branch that I wanted labeled.

I think that this use case to navigate to a particular part of a tree defined by an emerging clade is going to be a very frequent occurrence - so maybe it deserves to be implemented as well (unless I'm missing something and it's already possible to do that).

razsultana avatar Sep 29 '20 01:09 razsultana

In our Switzerland build, we label all branches with mutations to allow useful zooming in most cases:

https://github.com/neherlab/ncov-swiss/blob/master/swiss_profile/localrules.smk#L6

https://github.com/neherlab/ncov-swiss/blob/master/scripts/add_labels.py

rneher avatar Sep 29 '20 09:09 rneher

That is wonderful, Richard! Problem solved.

This is another example of convergent evolution, towards the solution that solved an urgent problem using the least amount of energy :) And in my case the energy was almost none, because of the lateral gene transfer - thank you, Richard!

BTW, can I just say that your software (augur, auspice, nextstrain website) is awesome! I think the science world is light years ahead because of it and this pandemic would have been quite different without it. You guys should be very proud!

razsultana avatar Sep 29 '20 19:09 razsultana

Glad it helped! thanks for your kind words!

rneher avatar Sep 29 '20 19:09 rneher

As discussed here this sounds like it requires an Augur solution, see nextstrain/augur#1068 for a recent duplicate.

corneliusroemer avatar Nov 02 '22 23:11 corneliusroemer