dud Multiple stages per file

One of the features in DVC that I did like were foreach stages. These were useful because sometimes there are stages that are basically the same command for different parameters. It is better to keep these together because they will all be changed at the same time.

You may not want to support foreach constructs etc; and that's fair. I can generate yaml files from Jinja or Jsonnet. But it's more of a pain to generate multiple separate files. So if a single file could have multiple stages, this would become more powerful.

Describe the solution you'd like


stages:
   - name:
     ... all current stage schema ...

allow the stage to be referenced like file.yaml:name

Describe alternatives you've considered

Writing each stage in multiple files
Programmatically generating multiple stage files

Aug 29 '23 06:08 indigoviolet

Actually, I realized that because the checksums are in the stage file, I cannot generate stage files easily. What do you think of this choice?

Aug 29 '23 17:08 indigoviolet

Addressing just the multiple stages per file request (not foreach) for now:

I wrestled with supporting this early on in development. Ultimately I decided against it to keep Dud simple--both in UI and implementation. In Dud's implementation, a stage's identifier is its file path. The Index object itself is represented as a map of file paths to stage contents. To change that would be both a large initial lift and a considerable maintenance burden. I'm not opposed to multi-stage files in principle, though. Pull requests are always welcome! 😜

That said, I think you're approaching the problem similarly to how I would recommend handling it in Dud today: Generate your stages programmatically. Yes, it's slightly more painful to create a file tree of stages, but not much so:

import json
stages = [
    {
        'name': 'train',
        ...
    },
    ...
]
with open('multistage.yaml', 'w') as f:
    json.dump(stages, f)

versus:

import json
from pathlib import Path
stages = {
    'train': {...},
    ...
}

for name, stage in stages.items():
    # mkdir only necessary for file trees; can omit if flat dir
    path = Path(name + '.yaml')
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open('w') as f:
        json.dump(stage, f)

As for handling checksums, you could either A) read existing files and update them with new values or B) skip existing files altogether.

FWIW: JSON is valid YAML, so Dud will happily read stages written like above (using json in the standard library). When Dud commits these stages, they'll be rewritten to YAML, though.

Aug 29 '23 22:08 kevin-hanselman

I think (for now, for me), that handling checksums in my own generation/merge code is a deal breaker - at that point I'm spending more time building an orchestrator tool than the pipeline :)

I would encourage you to reconsider the choice to put checksums into the stage file - that path seems to make things harder. For instance, a small inconvenience is that it reformats my yaml when it updates it (which is surprising).

Aug 29 '23 22:08 indigoviolet

I would encourage you to reconsider the choice to put checksums into the stage file - that path seems to make things harder. For instance, a small inconvenience is that it reformats my yaml when it updates it (which is surprising).

FWIW: Here's my recorded rationale for going with a combined stage file over separate lock files. TL;DR: All sorts of edge cases arise in the lock file model that simply disappear in the single file model. Which file do you trust when either is out of date? How do you even know which is out of date? etc.

Aug 29 '23 22:08 kevin-hanselman

You have thought about this a lot longer than I have so I expect you are right, but I don't think I understood the rationale from that commit message. Naively, it seems to me that the current stage file is a kind of combination of the (abstract) stage definition file and the abstract lock file (the checksums). If you can manually edit the stage file after the checksums have been inserted, don't you have the same problems, except now you only have one modified timestamp?

Aug 29 '23 23:08 indigoviolet

I am closing this issue for now, as it's a big change to Dud's UX and also a big change to implement. As discussed above, I will continue to ponder if/how a separate lock file could work, as I do see a number of benefits in that model--multiple stages per file being one of them, and also #163. Thanks for the thoughtful discussion, @indigoviolet!

Jun 29 '24 14:06 kevin-hanselman

dud dud copied to clipboard

Multiple stages per file

dud
dud copied to clipboard