nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Textual input wildcard.

Open MatthewJM96 opened this issue 2 years ago • 0 comments

New Feature & Justifying Scenario

Scenarios arise where an arbitrary number of same-named files get generated by some process which are then passed as an input to a subsequent process. The present wildcards * and ? solve the problem of uniquely naming the links to or copies of these files in the downstream process, however, I am unaware of the capability of leaving the full name pattern of the files underspecified with a text wildcard.

This is technically possible in the case of * for the case of it appearing alone in the file string in an input declaration, and I would love it if this could be made more well-defined either as a separate character or as an extension of that property of * to seq* form inputs (for BC purposes, I imagine the former is preferable).

Motivation for this, instead of fully specifying the filename with a numeric wildcard only for multiple instances of that filename, is two-fold: firstly, it can be very strongly preferable to retain some unique string in the filename that is only discovered at run-time (e.g. sample ID) for purposes of analysis outside of Nextflow or by subsequent workflows. Secondly, it allows for improved robustness of the wildcard-based input method in the case of using modules you are perhaps not in control of in a context where you might want to use one or other that use slightly different naming conventions - right now this has to be resolved with extra boilerplate for each different naming convention.

Implementation

Let's say the feature were to use a new character and let's choose to represent this capability with @. Consider the following patterns used in an input declaration:

  1. "@.report"
  2. "*.report"
  3. "@?.report"
  4. "@*.report"
  5. "@???.report"
  6. "@*"
  7. "@?"
  8. "@???"

The first and second would behave the same for a single file, while the first could either throw the "input file name collision" error for multiple files or continue to behave the same as * (I prefer the former, as it provides a more consistent behaviour).

For a single filename, the third and fourth would behave identically and for some number of files named "X.report" would either give as input "X.report" in the case of one file or "X1.report", "X2.report", "X3.report", etc in the case of multiple files. The fifth would do the same but with the usual padding zeroes of '?': "X001.report", "X002.report", "X003.report" etc.

The final three would be for consistency to the limiting case: it shouldn't, in the limit, be necessary to specify anything about the filename, but simply take all filenames passed to the input and simply append the integer as needed. This would, for the case of files with the name "X.report" yield "X.report1", "X.report2", "X.report3", and so on.

This behaviour of @ would also allow multiple filenames to be passed to the input, say some number of files either named "X.report" or "Y.report". The third and fourth would either give as input "X.report" and "Y.report" in the case of one file of each, or "X1.report", "X2.report", "X3.report", "Y1.report", "Y2.report", etc in the case of multiple files of each. The fifth would do the same but with the usual padding zeroes of '?': "X001.report", "X002.report", "X003.report", "Y001.report", "Y002.report", etc. The final three would look similar but with the incrementing integer placed at the end.

MatthewJM96 avatar Jun 15 '22 14:06 MatthewJM96