dud
                                
                                
                                
                                    dud copied to clipboard
                            
                            
                            
                        Multiple stages per file
One of the features in DVC that I did like were foreach stages. These were useful because sometimes there are stages that are basically the same command for different parameters. It is better to keep these together because they will all be changed at the same time.
You may not want to support foreach constructs etc; and that's fair. I can generate yaml files from Jinja or Jsonnet. But it's more of a pain to generate multiple separate files. So if a single file could have multiple stages, this would become more powerful.
Describe the solution you'd like
stages:
   - name:
     ... all current stage schema ...
 
allow the stage to be referenced like file.yaml:name
Describe alternatives you've considered
- Writing each stage in multiple files
 - Programmatically generating multiple stage files
 
Actually, I realized that because the checksums are in the stage file, I cannot generate stage files easily. What do you think of this choice?
Addressing just the multiple stages per file request (not foreach) for now:
I wrestled with supporting this early on in development. Ultimately I decided against it to keep Dud simple--both in UI and implementation. In Dud's implementation, a stage's identifier is its file path. The Index object itself is represented as a map of file paths to stage contents. To change that would be both a large initial lift and a considerable maintenance burden. I'm not opposed to multi-stage files in principle, though. Pull requests are always welcome! 😜
That said, I think you're approaching the problem similarly to how I would recommend handling it in Dud today: Generate your stages programmatically. Yes, it's slightly more painful to create a file tree of stages, but not much so:
import json
stages = [
    {
        'name': 'train',
        ...
    },
    ...
]
with open('multistage.yaml', 'w') as f:
    json.dump(stages, f)
versus:
import json
from pathlib import Path
stages = {
    'train': {...},
    ...
}
for name, stage in stages.items():
    # mkdir only necessary for file trees; can omit if flat dir
    path = Path(name + '.yaml')
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open('w') as f:
        json.dump(stage, f)
As for handling checksums, you could either A) read existing files and update them with new values or B) skip existing files altogether.
FWIW: JSON is valid YAML, so Dud will happily read stages written like above (using json in the standard library). When Dud commits these stages, they'll be rewritten to YAML, though.
I think (for now, for me), that handling checksums in my own generation/merge code is a deal breaker - at that point I'm spending more time building an orchestrator tool than the pipeline :)
I would encourage you to reconsider the choice to put checksums into the stage file - that path seems to make things harder. For instance, a small inconvenience is that it reformats my yaml when it updates it (which is surprising).
I would encourage you to reconsider the choice to put checksums into the stage file - that path seems to make things harder. For instance, a small inconvenience is that it reformats my yaml when it updates it (which is surprising).
FWIW: Here's my recorded rationale for going with a combined stage file over separate lock files. TL;DR: All sorts of edge cases arise in the lock file model that simply disappear in the single file model. Which file do you trust when either is out of date? How do you even know which is out of date? etc.
You have thought about this a lot longer than I have so I expect you are right, but I don't think I understood the rationale from that commit message. Naively, it seems to me that the current stage file is a kind of combination of the (abstract) stage definition file and the abstract lock file (the checksums). If you can manually edit the stage file after the checksums have been inserted, don't you have the same problems, except now you only have one modified timestamp?
I am closing this issue for now, as it's a big change to Dud's UX and also a big change to implement. As discussed above, I will continue to ponder if/how a separate lock file could work, as I do see a number of benefits in that model--multiple stages per file being one of them, and also #163. Thanks for the thoughtful discussion, @indigoviolet!