pollen icon indicating copy to clipboard operation
pollen copied to clipboard

`calyx_depth.py`: use the eDSL, stop using odgi's Python bindings

Open anshumanmohan opened this issue 1 year ago • 4 comments

Currently, calyx_depth.py generates Calyx code

  1. Without the use of the Calyx eDSL (aka the builder library).
  2. With the use of the odgi Python bindings (it's not obvious to see, but it uses parse_data which in turn uses the Python bindings).

I propose to change both of these.

  1. Using the eDSL will lighten the code significantly.
  2. The odgi Python bindings have proved hard to use. Besides, the current code uses the bindings only to infer the dimensions of the graph, and we have in-house ways of doing that now.

These relatively minor gains aside, the true win is that, after this, we will truly be using odgi only as a source of inspiration and oracle-testing, not as part of our workflow.

We currently do something like:

  • graph.gfa --odgi build--> graph.og --parse_data.py--> graph.data
  • calyx_depth.py (with one call to parse_data.py to learn the graph's dimensions) --> calyx_depth.futil
  • calyx_depth.futil tangoes with graph.data in the usual Calyx way.

My proposal is something more like:

  • graph.gfa --pollen_data_gen--> graph.data
  • calyx_depth.py (with one call to mygfa to learn the graph's dimensions, and with calls to the Calyx eDSL for lightening) --> calyx_depth.futil
  • calyx_depth.futil tangoes with graph.data in the usual Calyx way.

The first piece of this is already there: I already generate the same graph.data using pollen_data_gen as one would have using the usual route. The third piece should work if the first two work.

The second piece decomposes into two unrelated parts:

  • [ ] #107
  • [ ] #106

anshumanmohan avatar Jun 19 '23 00:06 anshumanmohan

Wonderful; thank you for summarizing this so clearly! Indeed, both of these would be great advances that would "modernize" our hardware generation.

One small operational note: the eDSL adoption can happen incrementally, if it's a lot of work. That is, it's not necessary to flip everything over all at once; it suffices to convert individual pieces to use the builder library one at a time since it interoperates with the "plain" AST library pretty smoothly.

sampsyo avatar Jun 24 '23 01:06 sampsyo

Thanks Adrian! Dropping the "triage required" tag. This issue remains up for grabs.

anshumanmohan avatar Jun 29 '23 20:06 anshumanmohan

This looks good to me! I agree that switching over the builder library isn't a top priority, but it would certainly be nice to at some point.

I don't necessarily think it is a bad thing to use odgi in our hardware generation if it turns out that it's faster to transform a .gfa graph to an .og graph and then use odgi to get its dimensions, rather than just obtaining the dimensions from the .gfa file directly, but since .gfa files seem to be sort of a "standard representation" for pangenomic graphs, it would be good to at least transition exine to taking in .gfa files as input rather than .og files, which it currently expects. I also think it would be a win just to avoid making pollen users install odgi.

susan-garry avatar Jul 13 '23 05:07 susan-garry

Indeed. Sticking with .gfa files is the right thing to do here because (a) they are the "source of truth," (b) odgi is a difficult-to-install dependency (to put it lightly), and (c) performance at hardware-generation time is not very important.

sampsyo avatar Jul 13 '23 20:07 sampsyo