dvc
dvc copied to clipboard
commit: support granularity
Update: see https://github.com/iterative/dvc/issues/4297#issuecomment-739848041
~~version : 1.2.2~~
~~commiting output instead of stage, when there is no dvc.yaml results in:~~
~~ERROR: failed to commit data - 'dvc.yaml' does not exist.~~
~~in 0.94.0 it used to be:
ERROR: failed to commit data - bad DVC-file name 'data'. DVC-files should be named 'Dvcfile' or have a '.dvc' suffix (e.g. 'data.dvc').~~
~~Reproduction script:~~
#!/bin/bash
rm -rf repo
mkdir repo
pushd repo
git init --quiet
dvc init --quiet
echo data >> data
dvc add data
git add -A
git commit -am "init"
dvc commit data
NOTE:
When dvc.yaml is present, error is more readable:
failed to commit data - "Stage 'data' not found inside 'dvc.yaml' file"
The problem here is that we are using collect instead of collect_granular when resolving the target. Don't see us using it pre-1.0, so it is likely just a missing functionality. Likely just need to do the switch, but need to keep in mind that one might supply a file in a tracked directory, which needs to be handled appropriately (e.g. commiting the whole stage is probably not the desired behaviour). CC @skshetry or am I missing something and we've done that for a reason?
@efiop, I don't think it was ever supported. Switching from collect to collect_granular would help, though it won't help with outputs being granular, it will commit all of the stage outputs.
@skshetry Thanks! So looks like collect_granular + filter_info support for commit and we should be set :slightly_smiling_face:
Discussed with @skshetry that we could start with just output-level granularity, e.g.
dvc add datafile
...
dvc commit datafile
dvc add datadir
...
dvc commit datadir
More granular commits like
dvc add datadir
....
dvc commit datadir/subdir/file
would require a bit more work on cache.save side, for it to accept filter_info and handle it properly, so we could do it as the next step after output-level granularity.
@efiop I would like to start working on this.
@mbiesek Great! Let us know if you'll have any questions :slightly_smiling_face:
#6195 fixed the cache poisoning when trying to use granular commit. We do support granular commit, but the content that we write in dvc.yaml/.dvc is still of a complete set of files.
@skshetry Can we close this one as completed or maybe open a new issue/edit the title to reflect the remaining work you would like to address? I think the original issue is solved, right?