ketrew The filesystem is not a database

Decided to make a rather general issue about a specific problem we talk about frequently. This spans Biokepi and Ketrew, but if resolved would likely be handled at the Ketrew level.

Ketrew/biokepi workflow nodes typically (given our file-centric workflows) rely on the filesystem for 1) checking that a product has been created as well as 2) checking if two products would be equivalent (and therefore merging those nodes/avoiding extra work).

Has some issues:

a. We don't use transactions, thereby ensuring we'll run into some order-of-operation problems. For example, I just killed a workflow node, and then restarted it. Well, this node has an on_failure_activates that is supposed to remove its intermediate product (a file); the problem was that my restarted node checked its condition ("does this product already exists?") and found it to be true (event though the product was actually not complete, since I terminated the process writing to it), and then the product was removed by the on_failure_activates node.
Has some issues:

If we don't capture everything in the title of the file about the configuration of a product which is substantive/could change the semantics of a file product, then two products which are not equivalent could be considered equivalent by Ketrew, and you'll have a "buggy" workflow.
One solution is to append a UUID to a product and let Ketrew determine if products are different using its internal representation of them/their specification. This could be (but shouldn't be) as simple as a hash of their JSON representation.
One problem with this that we've discussed is that some minor changes (e.g. changing the number of processes that a tool runs with, or the amount of memory it runs with) to a workflow node's make process (the specification of how the product of that node is made) could show up in the representation of that node, and make Ketrew falsely believe it has two different nodes, and perform (potentially slow, expensive) computation that it doesn't need to by starting a new node. This can be overcome by carefully splitting out configuration/options which do not actually change the product.
The advantage of this is that we prevent users from introducing bugs in their workflow by using information that Ketrew already has, for the disadvantage of
1. Workflow file names can become less explicit (they'll be less verbose, but with a UUID on the end of them)
2. More disk space could be used (though I'd advocate for a production Ketrew server to rm old versions of files)

Additional advantages of this change (towards using the Ketrew DB as the ultimate source of knowledge about workflow products) include

Provenance tools would naturally fall out of this; we would need good (CLI, as well as integrated with the web GUI) tools to query the Ketrew DB
We have this information already; we're just not using it
Perhaps this would encourage us to fine a more performant DB model of Ketrew nodes (this is a tough problem) https://github.com/hammerlab/ketrew/issues/423

Other disadvantages:

More difficult to hack the system:
- e.g. for b37decoy in Biokepi, I need to manually gunzip the file because of https://github.com/hammerlab/biokepi/issues/117 — this would be harder if Ketrew isn't just looking at files with a certain name. We would need to build in facilities (discussed above) to interact with the Ketrew DB.

Apr 13 '16 15:04 ihodes

So I read this as you want to use the DB managed by Ketrew to piggy back a filesystem-overlay?

This seems like a cross-layer hack, it means that the tool making sure stuff is run properly, is also doing higher-level data management.

Let's say Biokepi gets a second backend (partially) implemented. E.g. a subset of the Biokepi.EDSL is translated to a single Makefileor script to run many short steps together on a single node. How do you pass the info of what was already run by one backend to the other backend (ketrew)?

Most other avdantages there are not about that filesystem, they are about providing facilities to build more complex Condition.ts (check hashes, is-the-node-restarting/restartable, etc). This can be in Ketrew.EDSL or, if not generic enough, in Biokepi or another library.

Also, about

More difficult to hack the system: e.g. for b37decoy in Biokepi, I need to manually gunzip the file because of hammerlab/biokepi#117

No, right now, it is very easy to hack around the b37decoy problem, you just gunzip yourself, ignore gunzip's error, and let the file there with the right filename. Next time it's needed, the node will succeed as ``Already_done`. If you embed a mapping in Ketrew's DB between files and nodes, you will need a "register-my-hackish-workaround" command or something, that gives you back that filename to copy to. And if you want to share/reuse most of your Biokepi work-dir, all the users have to register the same hacks in their DB.

Apr 19 '16 21:04 smondet

Will respond to the "before the break" portion of this, some good stuff there, but wanted to clear up that we're on the same page wrt the difficulties of handling the b37 situation were we to use the ketrew database in lieu of the filesystem; that's why I added that under the "Other disadvantages [of switching away from our current model]".

Apr 19 '16 22:04 ihodes

I realized I never responded to the first part of this!

Most other avdantages there are not about that filesystem, they are about providing facilities to build more complex Condition.ts (check hashes, is-the-node-restarting/restartable, etc).

I agree; this could be a good way to get part of the way there. Conditions which check a database instead of stating a filename would be a big step up already.

From there, it'd be easy to just name files with readable-prefix-UUID.extension, where that UUID would correspond to the product and its provenance, stored in the database.

It's not far from what we're doing already; the primary change for our existing file-centric workflow is basically a KEDSL.create_filename ~prefix product function (or related suite of functions) which registers a file and its provenance with Ketrew and gives us a safe filename to use. Then Ketrew, when running the node producing that product,

Sees if the product has yet been registered (this would be false if e.g. the configuration changes, or the node hasn't run before)
If it has, then checks to see if the condition is true (Condition.t now being the bridge between Ketrew-land and silicon-space/the cluster). If not, rerun, if so; we're done.

Seems like a clean separation, and not a cross-layer hack if done like that?

Sep 01 '16 19:09 ihodes

@ihodes yes done like that, it's well separated, that way it can/should be in Biokepi. Not using Ketrew's DB.

Sep 01 '16 21:09 smondet