hail icon indicating copy to clipboard operation
hail copied to clipboard

[query] Hana / SEQR need support optimizing Hail Query code

Open danking opened this issue 8 months ago • 38 comments

What happened?

Hana Snow is the engineer for SEQR.

Previously, SEQR used elastic search as its datastore. Unfortunately, elastic search was very expensive because, to get reasonable performance, SEQR indexed nearly every field. The ES index was huge and the VM resources necessary to run an ES instance on that index were expensive (like 1000s USD per month).

I've been supporting Hana as much as I can, but she needs someone who can be more dedicated and responsive than me.

She uses a k8s cluster. She has a SEQR frontend deployment. She also has a Hail deployment (statefulset maybe?). The Hail pod has an SSD mounted read-only. That SSD has all the SEQR data in Hail Table form. There are many tables with annotations (variant metadata, like "probability this variant is damaging" or "likely causes this to happen to the protein"). There are also "per-family" tables which contain all the sequences within a single family. Many queries are directly against a particular family. Those tables are small and quick to read.

There's also one giant table containing all the sequences from all the families. That table is large and expensive to read. A lot of our engineering work has been around making sure queries against that table are fast.

Tim, at one point, had enough of her system locally that he could experiment with running queries on his laptop against his SSD. He hacked on the queries themselves and on Hail itself until the bandwidth was fast enough that the queries should complete fast enough on the full dataset. Fast enough varies but generally a couple tens of seconds is OK.

The work here is to pair with Hana to diagnose performance issues and make changes until the queries are acceptably fast. The first thing I would do is update her to the latest Hail (with the array decoder improvement as well as the memory overhead stuff on which Daniel is working). Then, with Hana's help, test the timing of some queries. If the queries are still too slow, your options are:

  1. Check the log files and the IR. Are there unnecessary shuffles? Is the code really large? Can we do less work maybe?
  2. Have Hana help you replicate her setup locally. You just need a slice of the data and enough of SEQR to run a query. Now hook up a profiler. What's slow? Can we do something about that?

Version

0.2.124

Relevant log output

No response

danking avatar Oct 20 '23 22:10 danking