seamless icon indicating copy to clipboard operation
seamless copied to clipboard

Improved SnakeMake integration

Open sjdv1982 opened this issue 3 years ago • 0 comments

Improved SnakeMake integration

Seamless has SnakeMake support by letting SnakeMake build its DAG, and then convert this DAG to a Seamless context. This is currently done using the snakemake2seamless command line tool, also because high-level macros (contexts with constructors) are not yet working.

SnakeMake always pulls, having as target a rule output or a set of files. If it is a rule, it may not contain wildcards. Therefore, SnakeMake always has well-defined, statically known output files. This is not always so for inputs and intermediate results. SnakeMake has two mechanisms to dynamically determine the input files of a rule. The "dynamic" flag delays the evaluation of a wildcard file pattern until runtime. It must be declared as the output of one rule, and, identically, as the input of one or more other rules. This mechanism is being deprecated in SnakeMake 6, in favor of checkpoint rules. Checkpoint rules are to be used together with input functions. If an input function tries to access a checkpoint rule, the input function is halted until the checkpoint rule has been evaluated, and then re-triggered. (Note that in all other cases, input functions are evaluated while the DAG is being built, so no special Seamless-side support for input functions is necessary.) Seamless will never, ever support either of these dynamic mechanisms. If you need dynamic DAGs, you need to do the dynamic part in Seamless, letting it generate a (static-DAG) Snakefile if needed. Example: Snakefile 1 takes a static number of input files to create a single clustering file. Snakefile 1 can be simply wrapped in a Seamless macro that does the same as snakemake2seamless. It requires the target rule / file list, a Snakefile, and optionally an OUTPUTS (see below) Snakefile 2 splits the clustering file into a clusterX.list for each cluster X. It may be a single rule that generates all the outputs; in that case, it must depend on a list OUTPUTS, e.g. ["1", "2", "3"]. OUTPUTS must be generated dynamically by a custom Seamless transformer that reads the clustering file, counts the clusters. Snakefile 2 can then be generated by a general-purpose transformer that takes in a rump Snakefile and outputs list and adds 'OUTPUTS = ["1", "2", "3"]' on top of the SnakeFile (This can be done using same Seamless macro, which may take an OUTPUT as an optional input). Alternatively, the rule may selectively extract specific clusters. In that case, Snakefile 2 itself is static, but must be invoked with a list of target files rather than a target rule. This list of target files is what must be generated by a custom Seamless transformer. (again, the same macro can execute it) Snakefile 3 generates clusterX.stat and clusterX.log for every cluster X. Snakefile 3 is static, but has a dynamic number of inputs and outputs. Again, you have the choice between generating OUTPUTS or generating the target files. In all cases, the macro offers the option to pass either individual "files" in separate input pins, or to pass in a whole filesystem-like JSON, creating a binding for each input "file". The output is always a filesystem-like JSON. Long-term improvements:

  • Support SnakeMake run-functions (Python code using the SnakeMake API) within rules, inside a static DAG.
  • Support for SnakeMake inputs/outputs that are a file list, rather than a single file.

sjdv1982 avatar Mar 23 '21 17:03 sjdv1982