dvc icon indicating copy to clipboard operation
dvc copied to clipboard

cli: add allow-missing flag to commit command

Open anunayasri opened this issue 1 year ago • 7 comments

Fixes #10524

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

anunayasri avatar Sep 13 '24 15:09 anunayasri

@skshetry Could you review this please, especially --allow-missing description. If it looks fine, I will update the docs repo as well and raise the PR.

anunayasri avatar Sep 13 '24 15:09 anunayasri

@anunayasri It would be great to have a test for this. Also, could you confirm its semantics - that it keeps md5s of the missing dirs / files / deps as-is?

shcheklein avatar Sep 15 '24 00:09 shcheklein

Looks like we don't have tests for allow_missing flag for commit. We don't write tests for CLI, as they are thin wrapper over Repo API. But since we are now going to depend on it, we need a test for it.

Do you mind adding a test for this API?

PTAL https://github.com/iterative/dvc/blob/main/tests/func/test_commit.py, which might help you how to write tests. If you have any questions, please feel free to ask here or in discord.

skshetry avatar Sep 16 '24 08:09 skshetry

@shcheklein There were no tests for cli commands, hence I had not added tests. It seems tests need to be added for the allow_missing logic in Repo.commit(). Will do so.

@skshetry Thanks. I will look into it.

anunayasri avatar Sep 16 '24 09:09 anunayasri

@anunayasri hey, are there any updates on this? do you need some help on this? (thanks for the PR btw!)

shcheklein avatar Sep 21 '24 21:09 shcheklein

Hi @shcheklein. Sorry. I was busy with office work and missed your comment. I have been chatting with @skshetry on discord DM. He helped me with my doubts. I am planning to work on the issue this week.

anunayasri avatar Sep 29 '24 18:09 anunayasri

Hi. I need help with writing the test case for dvc commit usecase for a pipeline. I tried checking the existing test case like dvc repro but I am not clear about few things. Could you help or point me to a code piece/doc.

  1. In the test case, generate a dvc.yaml file. It should have a dep on data file that is missing from the repo. How to generate this setup? I was trying out allow-missing through the example project example-get-started.
  2. dvc commit --allow-missing dataset should do nothing. I believe allow-missing is relevant to commit of pipelines.
  3. Could you explain the diff between stage.save() and stage.commit().

cc: @skshetry @shcheklein

anunayasri avatar Oct 16 '24 09:10 anunayasri

Hey @anunayasri. Sorry for the late response.

I took a look at the implementation, and it looks like --allow-missing is only implemented for the things that we need for experiments, where we run it in --force mode and only use it to commit data sources (see data_only=True).

Without --force, we show a prompt about changes to the users before committing. This is done through Stage.changed_entries(). It uses Output.workspace_status() to find out the changes, but how it has changed is getting lost inside it. So, we need to re-work that API and teach it about allow_missing.

Additionally, it does not work with missing dependencies at this time. It's due to "run cache," which fails on missing dependencies, but it should be easy to work around.


Hi. I need help with writing the test case for dvc commit usecase for a pipeline. I tried checking the existing test case like dvc repro but I am not clear about few things. Could you help or point me to a code piece/doc.

  1. In the test case, generate a dvc.yaml file. It should have a dep on data file that is missing from the repo. How to generate this setup? I was trying out allow-missing through the example project example-get-started.

Here's an example test case for pipelines:

def test_allow_missing(tmp_dir, dvc):
    tmp_dir.dvc_gen("foo", "foo")
    dvc.run(
        name="copy",
        cmd=["cp foo foo_copy1", "cp foo foo_copy2"],
        deps=["foo"],
        outs=["foo_copy1", "foo_copy2"],
    )

    (tmp_dir / "foo_copy1").unlink()

    (stage,) = dvc.commit("dvc.yaml", allow_missing=True)
    outs = {out.def_path: out.hash_info.value for out in stage.outs}
    assert outs == {
        "foo_copy1": "acbd18db4cc2f85cedef654fccc4a4d8",
        "foo_copy2": "acbd18db4cc2f85cedef654fccc4a4d8",
    }

    (tmp_dir / "foo_copy2").write_text("foobar", encoding="utf-8")
    (stage,) = dvc.commit("dvc.yaml", allow_missing=True)
    outs = {out.def_path: out.hash_info.value for out in stage.outs}
    assert outs == {
        "foo_copy1": "acbd18db4cc2f85cedef654fccc4a4d8",
        "foo_copy2": "3858f62230ac3c915f300c664312c63f",
    }

[!TIP] If you pass force=True to dvc.commit above, the test passes.

  1. dvc commit --allow-missing dataset should do nothing. I believe allow-missing is relevant to commit of pipelines.

--allow-missing dataset should do nothing if dataset is missing. That is correct. If it does exist, it should work the same way as without --allow-missing.

  1. Could you explain the diff between stage.save() and stage.commit().

save() hashes the output/dependency and updates its internal model (well, saves it), whereas commit hashes output/dependency, copies it to the cache, and checkout them back to the workspace (without changing its internal model as possible). See Output.save() and Output.commit() for more information.

skshetry avatar Oct 22 '24 10:10 skshetry