dvc data status: committed in index, not in cache entry shows up as modified

For the example in https://github.com/iterative/dvc/pull/8170#issuecomment-1227509867, output now looks like:

{
  "not_in_cache": [
    "dir/"
  ],
  "committed": {
    "modified": [
      "dir/"
    ]
  },
  "uncommitted": {
    "unknown": [
      "dir/file3",
      "dir/file4",
      "dir/file2"
    ]
  }
}

It still seems odd to me that the dir/ changes are considered committed when all the granular changes here are uncommitted. I understand that the md5 in the index dir.dvc doesn't match the Git HEAD, but not sure it makes sense to call the index committed when there is no cache entry for it.

Originally posted by @dberenbaum in https://github.com/iterative/dvc/issues/8189#issuecomment-1228672101

Aug 30 '22 06:08 skshetry

Thanks @skshetry! Any concerns with this or is it just a matter of figuring out the implementation?

Aug 30 '22 11:08 dberenbaum

I have been considering not-in-cache to be all edge-cases, we want to be correct first, that's why we tried to fix it in the best possible way. This with --add-commit and/or index not being in cache, is a niche-usecase or is rarer.

This all comes to what our definition of committed is. Looking at --no-commit, it might seem like it should be considered as uncommitted but it's working with a different definition of commit.

So, data status is working with index/dvcfiles, one version of a commit, whereas dvc add --no-commit is talking about cache-commit. There's another version of the commit that this issue is proposing: to consider both index-commit and cache-commit as one single commit, but only for the index, and that too only when the HEAD object exists.

Is index special? Yes, it is. It is the fulcrum of the data status that affects both directions in DVC committed changes and DVC uncommitted changes.

But do we want to apply a different meaning of commit to it? I am not so sure. Let's take an example of a freshly cloned repo, where you have no cache for the object in the HEAD and the index (well they are the same for a fresh repo). If we consider commit to be index-commit + cache-commit, then by definition, there are some uncommitted changes that in fact, are not true. My point is, to get the state of the working directory, we don't need objects in the cache, because well we are talking about "working directory". The cache is not strictly required.

I know, that if we considered Not in cache as an exclusive state, then a lot of this could have been simpler. But I think that way, it won't show the correct state of the working directory. With or without cache, I think the status should reflect the state of working directory which it does (with unknowns as a workaround but they are always there whether we report it or not).

I see the issue here, don't get me wrong. But it feels like we are optimizing for local optimum instead of a global one. I don't find this issue to be of much importance, the data status is not wrong or incorrect.

Yes, it just does not fit the workflow with --no-commit, but I see it as being a problem with it rather than data status. I see the following in docs:

--no-commit - do not store targets in the cache (the [.dvc](https://dvc.org/doc/user-guide/project-structure/dvc-files) file is still created). Use [dvc commit](https://dvc.org/doc/command-reference/commit) to finish the operation (similar to git commit after git add).

I don't see why it cannot generate a .dvc file without md5 hash. That way, they are not committed, which satisfies both commits, it's not index-committed yet and it's not cache-committed yet. This issue is the same as dvc.yaml file without dvc.lock or hashes recorded. See https://github.com/iterative/dvc/pull/8170#discussion_r953924440.

Again, I am not for or against the issue. I see the point. But I am not convinced that data status is the problem here. I need more strong opinion/push for this to be convinced that we have to apply different definiton of commit here to the index.

I was having trouble to have an organized thought here, because it feels like I am not able to weigh either of these options, but feel free to ask, I can elaborate.

Aug 30 '22 15:08 skshetry

Another thing is, in a distributed context, dvc data status will always show the same results/state regardless of the cache for a particular repo. Does it matter for the index? Probably not. But I think it corresponds well to the "state of the working directory", same state of the working directory -> same data status output.

Aug 30 '22 15:08 skshetry

There's another version of the commit that this issue is proposing: to consider both index-commit and cache-commit as one single commit, but only for the index, and that too only when the HEAD object exists.

I don't think it's specific to these cases. If the HEAD object doesn't exist, I would still expect the changes to be uncommitted. Can you think of an example where considering the commit to be both is problematic?

In general, I think changes depend on comparing to the Git HEAD, but whether those changes are committed depends on both the index and the cache since it's about whether DVC can recover those changes.

Yes, it just does not fit the workflow with --no-commit, but I see it as being a problem with it rather than data status.

I think we need separate flags for saving the md5 and not saving it, but both should be uncommitted.

The only example I can think of to get to a similar state without --no-commit is:

Make some change.
Commit it.
Drop the cache.

Example repro script:

echo file1 > file1
dvc add file1
git add .
git commit -m "add file"
echo file1 >> file1
dvc commit -f
dvc data status --show-json --granular --untracked --unchanged
rm -rf .dvc/cache
dvc data status --show-json --granular --untracked --unchanged

The changes would be considered committed before dropping the cache. I think they should be considered uncommitted after dropping the cache because DVC is no longer tracking them.

Aug 31 '22 20:08 dberenbaum

I think we need separate flags for saving the md5 and not saving it, but both should be uncommitted.

What's the usecase where you want to save the hash but not commit it?

Sep 01 '22 04:09 skshetry

dvc dvc copied to clipboard

data status: committed in index, not in cache entry shows up as modified

dvc
dvc copied to clipboard