outbreak.info icon indicating copy to clipboard operation
outbreak.info copied to clipboard

Refactor API calls within LineageComparisonComponent

Open flaneuse opened this issue 2 years ago • 0 comments

Right now, the API calls on LineageComparisonComponent (outbreak.info/compare-lineages) are very large, moving large amounts of data which is slow. As a result, our API backend often crashes when there are too many requests to this endpoint, as large amounts of data get shuttled around.

To create the heatmap on the page, the function getLineagesComparison calls getCharacteristicMutations(apiurl, lineage, 0, true, includeSublineages), which gives all mutations within a lineage, and then filters it to any mutation which appears in the lineage at a prevalence greater than the frequency threshold (default = 0.75). This step is necessary, because if you set frequency = 0.75, you would be missing data for mutations which exist in the lineage below the threshold:

Incorrect: missing cells for B.1.427 x A67V, B.1.427 x DEL69/70, B.1.427 x T95O, etc., which implies those mutations have not been found in the lineage, as opposed to "have been found, but at low prevalence": Screenshot 2022-11-18 at 12 25 36 PM

Correct but super slow, since the frequency=0 query is HUGE. Screenshot 2022-11-18 at 12 26 07 PM

To improve this, we could first get all the mutations which exist in the lineages above that threshold, then calculate the mutation prevalence in each lineage.

For instance, BA.3 and B.1.427 Comparison page:

  1. The initial API call should identify the mutations which occur in either of those lineages (BA.3 or B.1.427) at 75% or greater. This should identify the following set of mutations for each, just looking at gene == "S":
BA.3: ['s:g142d', 's:n211i', 's:d614g', 's:h655y', 's:n679k', 's:a67v', 's:del69/70', 's:n969k', 's:q954h', 's:d796y', 's:p681h', 's:del143/145', 's:del212/212', 's:t95i', 's:n764k'],
B.1.427: ['s:d614g', 's:l452r', 's:s13i', 's:w152c']
  1. Then, you can call https://api.outbreak.info/genomics/mutations-by-lineage with mutations as each of the mutations and pango_lineage as each of the lineages. (e.g. https://api.outbreak.info/genomics/mutations-by-lineage?mutations=S:A67V&pangolin_lineage=BA.3). You can combine mutations by AND to loop over each of them simultaneously -- however, mutations that don't exist within the lineage (like S:S13I in BA.3) will cause the entire API call to fail with a status code of 500.

First steps:

  • Profile if this approach would actually improve speed for a realistic set of lineages (for instance, the default set of lineages on outbreak.info/compare-lineages)
  • If so, implement it in the front-end.
  • Alternative approaches are welcome too.

flaneuse avatar Nov 19 '22 00:11 flaneuse