nextclade icon indicating copy to clipboard operation
nextclade copied to clipboard

Add automatic detection of virus type

Open ivan-aksamentov opened this issue 3 years ago • 3 comments

If and when the support for multiple visuses is added #213 we might also add an automatic detection of the virus type from sequences. This would conveniently fill the virus type selection dropdown, for instance.

ivan-aksamentov avatar Nov 05 '20 00:11 ivan-aksamentov

Some additional thoughts from the internal discussion:

The idea is to take the first sequence in the input fasta file and run the seed matching part of the alignment algorithm against every reference sequence and see what produces "the best" seeds. Not sure how well it would work in presence of different ref sequences, how to define "the bestness" and whether length of viruses have any consequence.

Also if the first sequence in fasta happens to be low quality or otherwise wrong it may give funny results.

ivan-aksamentov avatar Aug 27 '21 19:08 ivan-aksamentov

@trvrb noted that this:

the first sequence in fasta happens to be low quality

can be mitigated by taking ~10 sequences at random from the FASTA (instead of just the first sequence) and comparing each to all references

If I understand correctly, taking more than 1 sequence might also make the choice of a particular ref sequence more robust, even if the ref sequences to chose from are very similar.

@rneher points out that the alignment might be necessary (not only seed matching as I wrote above) to distinguish between similar reference sequences.

ivan-aksamentov avatar Aug 27 '21 19:08 ivan-aksamentov

A couple of UX challenges involved:

How does the workflow will look like for the user compared to what we have now?

Do we mandate autodetection? What if user just wants to pick a dataset manually, especially if the autodetection provides incorrect results?

Do we run the analysis right after autodetection is complete or do we wait for user to confirm?

There is a few technical challenges are involved as well:

  • All of ref sequences need to be gathered and downloaded for each candidate dataset. As datasets change, there is a potential for autodetection to change the result as well. Which may or may not be desired.
  • The fasta file need to be parsed in order to pick the samples. The fasta parser is currently in the C++ code
  • The fasta file need to be parsed fully in order to pick the samples uniformly. Which might take quite some time for large files.
  • The alignment algorithm is in C++ now as well
  • Running C++ code requires first downloading and instantiating of a WASM module, spawning of a WebWorker thread, and marshaling the data over. All that takes a second or two, depending on connection speed and hardware performance. It may take longer to spawn WASM than do the detection itself, and even than picking the dataset manually.
  • An edge case: if we are not lucky, alignment/seed matching can potentially fail for all of the randomly picked sequences

There is definitely some engineering to be done here.

We might blow dust from the JS implementation of fasta parsing and alignment, which will avoid WAWM. But that's basically duplicated code to maintain and it might be somewhat dated. On the other hand it does not have to match the actual alignment code or to be perfect at all.

Alternatively, we can go with the wait spinner during autodetection, but then reuse the workers and WASM modules for the main analysis, so at least users don't have to wait twice.

ivan-aksamentov avatar Aug 27 '21 19:08 ivan-aksamentov