dvc
dvc copied to clipboard
cli: add allow-missing flag to commit command
Fixes #10524
-
[x] ❗ I have followed the Contributing to DVC checklist.
-
[ ] 📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏
@skshetry Could you review this please, especially --allow-missing description. If it looks fine, I will update the docs repo as well and raise the PR.
@anunayasri It would be great to have a test for this. Also, could you confirm its semantics - that it keeps md5s of the missing dirs / files / deps as-is?
Looks like we don't have tests for allow_missing flag for commit. We don't write tests for CLI, as they are thin wrapper over Repo API. But since we are now going to depend on it, we need a test for it.
Do you mind adding a test for this API?
PTAL https://github.com/iterative/dvc/blob/main/tests/func/test_commit.py, which might help you how to write tests. If you have any questions, please feel free to ask here or in discord.
@shcheklein There were no tests for cli commands, hence I had not added tests. It seems tests need to be added for the allow_missing logic in Repo.commit(). Will do so.
@skshetry Thanks. I will look into it.
@anunayasri hey, are there any updates on this? do you need some help on this? (thanks for the PR btw!)
Hi @shcheklein. Sorry. I was busy with office work and missed your comment. I have been chatting with @skshetry on discord DM. He helped me with my doubts. I am planning to work on the issue this week.
Hi. I need help with writing the test case for dvc commit usecase for a pipeline. I tried checking the existing test case like dvc repro but I am not clear about few things. Could you help or point me to a code piece/doc.
- In the test case, generate a
dvc.yamlfile. It should have a dep on data file that is missing from the repo. How to generate this setup? I was trying outallow-missingthrough the example projectexample-get-started. dvc commit --allow-missing datasetshould do nothing. I believeallow-missingis relevant to commit of pipelines.- Could you explain the diff between
stage.save()andstage.commit().
cc: @skshetry @shcheklein
Hey @anunayasri. Sorry for the late response.
I took a look at the implementation, and it looks like --allow-missing is only implemented for the things that we need for experiments, where we run it in --force mode and only use it to commit data sources (see data_only=True).
Without --force, we show a prompt about changes to the users before committing. This is done through Stage.changed_entries(). It uses Output.workspace_status() to find out the changes, but how it has changed is getting lost inside it. So, we need to re-work that API and teach it about allow_missing.
Additionally, it does not work with missing dependencies at this time. It's due to "run cache," which fails on missing dependencies, but it should be easy to work around.
Hi. I need help with writing the test case for
dvc commitusecase for a pipeline. I tried checking the existing test case likedvc reprobut I am not clear about few things. Could you help or point me to a code piece/doc.
- In the test case, generate a
dvc.yamlfile. It should have a dep on data file that is missing from the repo. How to generate this setup? I was trying outallow-missingthrough the example projectexample-get-started.
Here's an example test case for pipelines:
def test_allow_missing(tmp_dir, dvc):
tmp_dir.dvc_gen("foo", "foo")
dvc.run(
name="copy",
cmd=["cp foo foo_copy1", "cp foo foo_copy2"],
deps=["foo"],
outs=["foo_copy1", "foo_copy2"],
)
(tmp_dir / "foo_copy1").unlink()
(stage,) = dvc.commit("dvc.yaml", allow_missing=True)
outs = {out.def_path: out.hash_info.value for out in stage.outs}
assert outs == {
"foo_copy1": "acbd18db4cc2f85cedef654fccc4a4d8",
"foo_copy2": "acbd18db4cc2f85cedef654fccc4a4d8",
}
(tmp_dir / "foo_copy2").write_text("foobar", encoding="utf-8")
(stage,) = dvc.commit("dvc.yaml", allow_missing=True)
outs = {out.def_path: out.hash_info.value for out in stage.outs}
assert outs == {
"foo_copy1": "acbd18db4cc2f85cedef654fccc4a4d8",
"foo_copy2": "3858f62230ac3c915f300c664312c63f",
}
[!TIP] If you pass
force=Truetodvc.commitabove, the test passes.
dvc commit --allow-missing datasetshould do nothing. I believeallow-missingis relevant to commit of pipelines.
--allow-missing dataset should do nothing if dataset is missing. That is correct. If it does exist, it should work the same way as without --allow-missing.
- Could you explain the diff between
stage.save()andstage.commit().
save() hashes the output/dependency and updates its internal model (well, saves it), whereas commit hashes output/dependency, copies it to the cache, and checkout them back to the workspace (without changing its internal model as possible).
See Output.save() and Output.commit() for more information.