dvc icon indicating copy to clipboard operation
dvc copied to clipboard

`data:status`: errors when data not in cache

Open mattseddon opened this issue 3 years ago • 10 comments

Bug Report

Description

On initially cloning a repository all tracked paths are shown in three categories:

  1. Not in cache
  2. Committed modified
  3. Uncommitted deleted

There can also be missing data, for example in the vscode-dvc demo project: training_metrics is missing. This is produced by DVCLive and is listed under the plots key in the dvc.yaml.

This breaks one of the workflows in the VS Code extension.

Reproduce

  1. git clone https://github.com/iterative/vscode-dvc
  2. cd vscode-dvc/demo
  3. python3 -m virtualenv .env
  4. source .env/bin/activate
  5. pip install -r requirements.txt
  6. dvc data status --show-json --with-dirs --granular --untracked --unchanged
{
  "not_in_cache": [
    "model.pt",
    "misclassified.jpg",
    "predictions.json"
  ],
  "committed": {
    "modified": [
      "model.pt",
      "misclassified.jpg",
      "predictions.json"
    ]
  },
  "uncommitted": {
    "deleted": [
      "model.pt",
      "misclassified.jpg",
      "predictions.json"
    ]
  }
}

Note: DVC may need to be updated to 2.15.0 in the requirements.txt file.

Expected

Paths are returned in not in cache key only. All paths are returned.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.15.0 (pip)
---------------------------------
Platform: Python 3.10.5 on macOS-12.2.1-arm64-arm-64bit
Supports:
        webhdfs (fsspec = 2022.5.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.5.2),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.5.2),
        s3 (s3fs = 2022.5.0, boto3 = 1.21.21)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc (subdir), git

Additional Information (if any):

Also verified on dvc-2.15.1.dev13+g9ff18502.

mattseddon avatar Jul 28 '22 00:07 mattseddon

This bug is actually worse than I had thought. The entirety of data/MNIST/raw is also missing which means there is no data to experiment against.

mattseddon avatar Jul 29 '22 22:07 mattseddon

As discussed, the not in cache and deleted states are both expected here, so it's more of a product question of whether there is a better way to represent those paths in either DVC or VS Code.

However, modified is not expected and appears to show only when using --granular.

This bug is actually worse than I had thought. The entirety of data/MNIST/raw is also missing which means there is no data to experiment against.

Yeah, this should show in the same states as the other paths.

dberenbaum avatar Aug 02 '22 21:08 dberenbaum

If I understand correctly the two main situations where a user can get into this situation are:

  1. Repository has just been cloned and the user has never pulled.
  2. The user has dropped the cache AND deleted the file(s) from the workspace.

My reasoning for giving not in cache a higher priority is that IMO 1 is far more likely than 2. The reason that I see 2 as unlikely is that I may drop the cache but it is unlikely that I would drop the cache and then start deleting files that I actually want to remove permanently/commit.

Another reason is that in VS Code we have certain actions associated with each status. For uncommitted deleted the actions are checkout/commit and for not in cache the only option is pull. Checkout is not actually possible in the situation that the file is not in cache which means that we would like to encourage the user to populate the cache first before trying to checkout. I also realise that pull will fix both problems.

@dberenbaum LMK what you think of that. We also should discuss/try to solve hiding the concept of the cache completely from the user in VS Code, not sure if this is the right place to do that.

mattseddon avatar Aug 03 '22 16:08 mattseddon

Makes sense @mattseddon.

To be clear, DVC is prioritizing not in cache and suggesting pull:

$ dvc data status
Not in cache:
  (use "dvc pull <file>..." to update your local storage)
        model.pt
        training_metrics
        misclassified.jpg
        predictions.json
        data/MNIST/raw

DVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
        deleted: model.pt
        deleted: misclassified.jpg
        deleted: predictions.json

We could consider not returning the uncommitted changes section in that case, but I'm not sure. I remember not in cache causing some headaches before, and I don't know what to do in other cases with not in cache:

  • If someone adds some content to a not-in-cache path, should we also show that the files have changed, or only show the status as not in cache?
  • Let's say someone has pulled or otherwise obtained the raw data but aren't familiar with dvc pipelines and run python train.py. They would end up with unchanged outputs that are still not in cache. So this would have the same status as a fresh clone even though the state of the repo is different. Is that okay?

dberenbaum avatar Aug 03 '22 20:08 dberenbaum

  • If someone adds some content to a not-in-cache path, should we also show that the files have changed, or only show the status as not in cache?

What happens if I then commit this change? Would DVC handle this case?

I did some testing of this situation with the vscode-dvc demo project and I can see that DVC does handle this case. However, dropping the cache after pulling all of the data gives these results:

❯ dvc data status             
Not in cache:
  (use "dvc pull <file>..." to update your local storage)
        model.pt
        training_metrics/
        misclassified.jpg
        predictions.json
        data/MNIST/raw/

DVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
        added: training_metrics/
        added: data/MNIST/raw/

❯ dvc data status --with-dirs --granular
Not in cache:
  (use "dvc pull <file>..." to update your local storage)
        model.pt
        misclassified.jpg
        predictions.json

DVC committed changes:
  (git commit the corresponding dvc files to update the repo)
        modified: model.pt
        modified: misclassified.jpg
        modified: predictions.json

DVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
        modified: model.pt
        modified: misclassified.jpg
        modified: predictions.json
        added: training_metrics/
        added: training_metrics/scalars/acc.tsv
        added: training_metrics/scalars/loss.tsv
        added: data/MNIST/raw/train-labels-idx1-ubyte.gz
        added: data/MNIST/raw/train-images-idx3-ubyte
        added: data/MNIST/raw/t10k-labels-idx1-ubyte.gz
        added: data/MNIST/raw/train-images-idx3-ubyte.gz
        added: data/MNIST/raw/t10k-images-idx3-ubyte
        added: data/MNIST/raw/t10k-images-idx3-ubyte.gz
        added: data/MNIST/raw/
        added: data/MNIST/raw/train-labels-idx1-ubyte
        added: data/MNIST/raw/t10k-labels-idx1-ubyte

IMO in order for us to cater for the use case that you mentioned, we would need to suppress all of the "added" information for files that are already present in the workspace. I.e if the workspace is up to date with what is expected by DVC and I added data/MNIST/raw/something-new then I would expect to only see that underneath the added key. Not the entirety of the folder as it is already tracked and present.

  • Let's say someone has pulled or otherwise obtained the raw data but aren't familiar with dvc pipelines and run python train.py. They would end up with unchanged outputs that are still not in cache. So this would have the same status as a fresh clone even though the state of the repo is different. Is that okay?

Starting with the workspace in the state that you described above and manually running python train.py gets us back to the exact same state that I showed above (not in cache + uncommitted added for data status and not in cache + uncommitted added + committed modified for data status --granular --with-dirs). I would again be ok to show these files as only not in cache. With any additions showing up.

On a side note. I am now trying to think of how we can hide this terminology on the VS Code side. If we change the terminology to something like "Remote Only" or "Available From Remote" maybe that would make it slightly more intuitive for new users and they would be more likely to know how to "fix the issue". The reason that I am suggesting this is that this "second layer" of the cache seems to be causing a lot of edge cases and if we direct the user flow in SCM towards always pulling first before working with the repository (under any circumstances) we might make things easier for ourselves and ultimately users. I.e the easiest thing to teach would be "if you see "Remote Only", your first action should be to pull".

mattseddon avatar Aug 05 '22 15:08 mattseddon

IMO in order for us to cater for the use case that you mentioned, we would need to suppress all of the "added" information for files that are already present in the workspace. I.e if the workspace is up to date with what is expected by DVC and I added data/MNIST/raw/something-new then I would expect to only see that underneath the added key. Not the entirety of the folder as it is already tracked and present.

Agreed, this looks like it's a bug (cc @skshetry).

I would again be ok to show these files as only not in cache.

My point here was that it seems we are suggesting the same status (not in cache) here regardless of whether the workspace files are missing or exist. Could that be confusing? And do we want to suggest to do a pull when something does exists in the workspace already?

If we change the terminology to something like "Remote Only" or "Available From Remote" maybe that would make it slightly more intuitive for new users and they would be more likely to know how to "fix the issue".

Yeah, I agree it overall makes sense to rephrase "not in cache" to something like this or "needs download/pull" or something (in fact, this category was originally proposed to be called "Not in DVC local storage" but it was more confusing during development since people are used to the "not in cache" terminology). Right now, we aren't checking if it actually exists in the remote, but there's not much point in the "not in cache" status except to suggest to pull.

dberenbaum avatar Aug 05 '22 16:08 dberenbaum

Would it make the most sense then to only show files as not in cache (or whatever terminology that we end up using) if they are not present in either the cache or the workspace? I.e uncommitted deleted + not in cache = remote only. Would that make sense?

mattseddon avatar Aug 05 '22 16:08 mattseddon

@dberenbaum

I've made updates on our end to make it possible to show resources across multiple categories in the SCM view.

In terms of decorations, we can only have 1 per resource so I will apply decorations to each resource using the following prioritization:

  1. Not In Cache
  2. Uncommitted
  3. Committed

The only other thing to note is that I have removed the inline checkout option for resources that are both not in cache and uncommitted.

mattseddon avatar Aug 08 '22 03:08 mattseddon

Not in cache is tricky, as there are too many scenarios to think of.

skshetry avatar Aug 08 '22 16:08 skshetry

For committed changes where there's no cache, we probably can just look at the hashes and tell they are unchanged, and only report not in cache, so we can avoid modified here.

I'll try to look into more scenarios where we can avoid the modified/deleted stuff. I have been using this defintion of not in cache for now:

not in cache: An output exists in the workspace, and the corresponding file hash in the dvc.lock or .dvc file is up to date, but there is no corresponding cache file or directory.

skshetry avatar Aug 08 '22 16:08 skshetry

However, modified is not expected and appears to show only when using --granular.

This bug is actually worse than I had thought. The entirety of data/MNIST/raw is also missing which means there is no data to experiment against.

Yeah, this should show in the same states as the other paths.

@skshetry Can you confirm that these issues are bugs that we can address immediately? How to show deleted paths that need to be pulled is a larger product question that is less urgent IMO (although also important).

dberenbaum avatar Aug 10 '22 13:08 dberenbaum

Any update on this? I've been delaying a release waiting for it.

mattseddon avatar Aug 23 '22 03:08 mattseddon

@mattseddon, I am working on it, hopefully by the end of this week. 🙂

skshetry avatar Aug 23 '22 03:08 skshetry