bionode-watermill
bionode-watermill copied to clipboard
Refactor pipeline parsing
Right now bionode-watermill resolves a set of promises,
const pipeline = join(task1, task2)
but it would be nice to have a object-like structure in pipeline definition
const pipeline = { join: { task1, task2 } }
And then bionode-watermill would parse this object to first get the pipeline before actually execute it and then execute it. This way it would be possible to predict (before running) the pipeline shape, inputs and outputs and then after running the pipeline confirm that everything was properly set and executed as expected. Also, this can greatly increase pipeline visualization in the sense that we can improve visualization to render different colors for what was run, is running and ended.
Two points:
- We will need to be very precise about use of objects and arrays, your example could easily be:
const pipeline = { join: [ task1, task2 ] }
- Last time I thought about this, I went and tried to convert an existing, somewhat complex pipeline into an object. It ended up being kinda verbose. So I'm partial to keeping the function style (since it's nice and terse, almost like a DSL, user needs to import specific functions which can each have params validated etc, as opposed to manual keys in objects) but have it build an object representation internally - the "verbose" representation. However maybe this verbose object would be more appealing in YAML (but converting strings into JS functions is sketchy - scripts is fine).
PS: I think also we need not necessarily have an object representation to have a pre computed DAG - it does look like those two features go hand-in-hand, but it is also probably possible to use the orchestrator functions and construct a DAG. If it is possible, I would prefer to implement new features (pre-computed DAG) without overhauls of other things if possible.
Just one more thing: after this refactor where DAG can be obtained before running the pipeline, fork
also can be refactored to work without those bunch of rules in lib/orchestrators/join.js
, since we will have tasks available before running the pipeline. The issue right now is that we can only fetch tasks when we run a given orchestrator, but everything else outside the scope of that orchestrator is unavailable to handle and fork
needs to handle downstream tasks (tasks that run after fork
), by multiplying them as much as the number of branches that fork
has.