dvc icon indicating copy to clipboard operation
dvc copied to clipboard

`data status`: throws unexpected error if any dvc.yaml in the workspace is invalid

Open mattseddon opened this issue 3 years ago • 8 comments

Bug Report

Description

If DVC is unable to parse any of the dvc.yamls in the workspace dvc data status throws an unexpected error. E.g:

❯ dvc data status --with-dirs --granular --show-json
'./dvc.yaml' validation failed.

expected bool, in stages -> evaluate -> plots -> 1 -> evaluation/plots/confusion_matrix.json -> cache, line 45, column 9
  44 │   - evaluation/plots/confusion_matrix.json:                                                                                                                                        
  45 │   │   cache: false``                                                                                                                                                               
  46 │   │   template: confusion      

Reproduce

  1. open project
  2. modify any dvc.yaml so that it is unparsable
  3. run dvc data status --json

Expected

The extension needs a reliable way to identify this error so that:

  1. We know not to retry until the file has been edited
  2. We can convey this information to the user.

I think it would be best if data status --json returns an object of the form { error: { type: string, msg: string } } under these circumstances. We could agree on a contract for the type and the extension can adjust its approach accordingly.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.21.1 (pip)
---------------------------------
Platform: Python 3.8.9 on macOS-12.5.1-arm64-arm-64bit
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.5.2),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.5.2)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

Additional Information (if any):

Original discussion is in https://github.com/iterative/vscode-dvc/issues/2254

dvc list . --dvc-only --recursive shows the same behaviour

verbose output from data status
~/p/example-get-started random-experiments !8 ?2 ❯ dvc data status --json --verbose                                                           ✘ 255
'./dvc.yaml' is invalid.

While scanning a simple key, in line 12, column 3
  12   featurize:''                                                                                                                                                                       

Could not find expected ':', in line 13, column 8
  13 │   cmd: python src/featurization.py data/prepared data/features                                                                                                                     
2022-08-26 12:32:37,777 ERROR: unable to read: 'dvc.yaml', YAML file structure is corrupted: while scanning a simple key
  in "<unicode string>", line 12, column 3
could not find expected ':'
  in "<unicode string>", line 13, column 8
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/funcy/flow.py", line 112, in reraise
    yield
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/utils/serialize/_yaml.py", line 29, in parse_yaml
    return yaml.load(text) or {}
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/ruamel/yaml/main.py", line 434, in load
    return constructor.get_single_data()
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 119, in get_single_data
    node = self.composer.get_single_node()
  File "_ruamel_yaml.pyx", line 706, in _ruamel_yaml.CParser.get_single_node
  File "_ruamel_yaml.pyx", line 724, in _ruamel_yaml.CParser._compose_document
  File "_ruamel_yaml.pyx", line 775, in _ruamel_yaml.CParser._compose_node
  File "_ruamel_yaml.pyx", line 889, in _ruamel_yaml.CParser._compose_mapping_node
  File "_ruamel_yaml.pyx", line 775, in _ruamel_yaml.CParser._compose_node
  File "_ruamel_yaml.pyx", line 889, in _ruamel_yaml.CParser._compose_mapping_node
  File "_ruamel_yaml.pyx", line 775, in _ruamel_yaml.CParser._compose_node
  File "_ruamel_yaml.pyx", line 889, in _ruamel_yaml.CParser._compose_mapping_node
  File "_ruamel_yaml.pyx", line 773, in _ruamel_yaml.CParser._compose_node
  File "_ruamel_yaml.pyx", line 852, in _ruamel_yaml.CParser._compose_sequence_node
  File "_ruamel_yaml.pyx", line 904, in _ruamel_yaml.CParser._parse_next_event
ruamel.yaml.scanner.ScannerError: while scanning a simple key
  in "<unicode string>", line 12, column 3
could not find expected ':'
  in "<unicode string>", line 13, column 8

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/commands/data.py", line 102, in run
    status = self.repo.data_status(
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/data.py", line 287, in status
    uncommitted_diff = _diff_index_to_wtree(repo, **kwargs)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/data.py", line 181, in _diff_index_to_wtree
    for out in repo.index.outs:
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/index.py", line 118, in outs
    for stage in self:
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/index.py", line 92, in __iter__
    yield from self.stages
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/funcy/objects.py", line 28, in __get__
    res = instance.__dict__[self.fget.__name__] = self.fget(instance)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/index.py", line 74, in stages
    return self.stage_collector.collect_repo(onerror=onerror)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/stage.py", line 516, in collect_repo
    return list(self._collect_repo(onerror))
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/stage.py", line 499, in _collect_repo
    new_stages = self.load_file(file_path)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/stage.py", line 319, in load_file
    return self.load_all(path)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/stage.py", line 291, in load_all
    stages = dvcfile.stages  # type: ignore
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/dvcfile.py", line 281, in stages
    data, _ = self._load()
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/dvcfile.py", line 150, in _load
    return self._load_yaml(**kwargs)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/dvcfile.py", line 161, in _load_yaml
    return strictyaml.load(
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/utils/strictyaml.py", line 296, in load
    raise YAMLSyntaxError(path, text, exc, rev=rev) from cause
dvc.utils.strictyaml.YAMLSyntaxError: unable to read: 'dvc.yaml', YAML file structure is corrupted
------------------------------------------------------------
2022-08-26 12:32:37,836 DEBUG: Analytics is enabled.
2022-08-26 12:32:37,860 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmpthfj_0q0']'
2022-08-26 12:32:37,861 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmpthfj_0q0']'

mattseddon avatar Aug 26 '22 02:08 mattseddon

I am fine with introducing the concept of partial results in dvc data status --json and exit with 2, but I am not sure if we need to do this in the CLI as well. At least I am not sure, how we can tell users that dvc.yaml is broken and how it is broken, which the above error provides in a nice way.

Failing on a broken file in the index and skipping from HEAD is an option for CLI.

For partial results and/or --json, we can skip broken files/stages from the index for uncommitted changes, and skip either of the broken files in the HEAD and the index. I'd love to make data status infallible, but we have to do that in a way that the results are not misleading and the error is obvious to the users.

skshetry avatar Aug 26 '22 07:08 skshetry

We can start with a simple way here, skip the broken file in the index, and assume that they are not broken in the HEAD. Since we are comparing between the workspace and the index using those same files, the results won't be misleading.

skshetry avatar Aug 26 '22 07:08 skshetry

If we want to start simple, I agree with https://github.com/iterative/vscode-dvc/issues/2254#issuecomment-1227678984 that we should start with no changes except passing at least some error info via JSON or some format that will work for VS Code.

For JSON, do we need to have a general guideline for all commands that errors should be output in JSON?

dberenbaum avatar Aug 27 '22 19:08 dberenbaum

For JSON, do we need to have a general guideline for all commands that errors should be output in JSON?

It would be good to head in that direction.

mattseddon avatar Aug 29 '22 00:08 mattseddon

@mattseddon Given https://github.com/iterative/vscode-dvc/pull/2301, is this still an issue? Do you still need JSON errors? What info do you need and how important is it?

dberenbaum avatar Aug 31 '22 12:08 dberenbaum

@mattseddon Given iterative/vscode-dvc#2301, is this still an issue? Do you still need JSON errors? What info do you need and how important is it?

For this particular issue, it would be really good to get a list of failures (unparsable files) back from the CLI, I could then show exactly what has failed in one of our views and add an on-click command to open any of those particular file(s) from the same view.

If I am being completely honest this is probably just nice to have, not a pressing priority. Others might feel differently but IMO the main reason the file would be corrupted is that you're currently editing it. In the case of VS Code it should be very obvious that it is broken due to schema validation.

mattseddon avatar Aug 31 '22 22:08 mattseddon

Thanks @mattseddon! I'm a bit confused because it looks to me like https://github.com/iterative/vscode-dvc/pull/2301 already shows the broken file and even the source of the error within the file. What am I missing?

dberenbaum avatar Sep 01 '22 00:09 dberenbaum

The missing part is having the path to the broken file(s) as a string (or list of strings) instead of having to parse the text to work it out.

In that example I display the entire stderr from the error message, VS Code truncates the string into the tree (probably at the first \n). When I hover over it the full string is shown. In order to get ./dvc.yaml from the error message, I would have to rely on the format of the string and/or use regex to parse it. I am not comfortable doing that.

mattseddon avatar Sep 01 '22 01:09 mattseddon

Patched on the extension side

mattseddon avatar Apr 07 '23 03:04 mattseddon