dvc
dvc copied to clipboard
`data status`: throws unexpected error if any dvc.yaml in the workspace is invalid
Bug Report
Description
If DVC is unable to parse any of the dvc.yamls in the workspace dvc data status throws an unexpected error. E.g:
❯ dvc data status --with-dirs --granular --show-json
'./dvc.yaml' validation failed.
expected bool, in stages -> evaluate -> plots -> 1 -> evaluation/plots/confusion_matrix.json -> cache, line 45, column 9
44 │ - evaluation/plots/confusion_matrix.json:
45 │ │ cache: false``
46 │ │ template: confusion
Reproduce
- open project
- modify any
dvc.yamlso that it is unparsable - run
dvc data status --json
Expected
The extension needs a reliable way to identify this error so that:
- We know not to retry until the file has been edited
- We can convey this information to the user.
I think it would be best if data status --json returns an object of the form { error: { type: string, msg: string } } under these circumstances. We could agree on a contract for the type and the extension can adjust its approach accordingly.
Environment information
Output of dvc doctor:
$ dvc doctor
DVC version: 2.21.1 (pip)
---------------------------------
Platform: Python 3.8.9 on macOS-12.5.1-arm64-arm-64bit
Supports:
http (aiohttp = 3.8.1, aiohttp-retry = 2.5.2),
https (aiohttp = 3.8.1, aiohttp-retry = 2.5.2)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Additional Information (if any):
Original discussion is in https://github.com/iterative/vscode-dvc/issues/2254
dvc list . --dvc-only --recursive shows the same behaviour
verbose output from data status
~/p/example-get-started random-experiments !8 ?2 ❯ dvc data status --json --verbose ✘ 255
'./dvc.yaml' is invalid.
While scanning a simple key, in line 12, column 3
12 featurize:''
Could not find expected ':', in line 13, column 8
13 │ cmd: python src/featurization.py data/prepared data/features
2022-08-26 12:32:37,777 ERROR: unable to read: 'dvc.yaml', YAML file structure is corrupted: while scanning a simple key
in "<unicode string>", line 12, column 3
could not find expected ':'
in "<unicode string>", line 13, column 8
------------------------------------------------------------
Traceback (most recent call last):
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/funcy/flow.py", line 112, in reraise
yield
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/utils/serialize/_yaml.py", line 29, in parse_yaml
return yaml.load(text) or {}
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/ruamel/yaml/main.py", line 434, in load
return constructor.get_single_data()
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 119, in get_single_data
node = self.composer.get_single_node()
File "_ruamel_yaml.pyx", line 706, in _ruamel_yaml.CParser.get_single_node
File "_ruamel_yaml.pyx", line 724, in _ruamel_yaml.CParser._compose_document
File "_ruamel_yaml.pyx", line 775, in _ruamel_yaml.CParser._compose_node
File "_ruamel_yaml.pyx", line 889, in _ruamel_yaml.CParser._compose_mapping_node
File "_ruamel_yaml.pyx", line 775, in _ruamel_yaml.CParser._compose_node
File "_ruamel_yaml.pyx", line 889, in _ruamel_yaml.CParser._compose_mapping_node
File "_ruamel_yaml.pyx", line 775, in _ruamel_yaml.CParser._compose_node
File "_ruamel_yaml.pyx", line 889, in _ruamel_yaml.CParser._compose_mapping_node
File "_ruamel_yaml.pyx", line 773, in _ruamel_yaml.CParser._compose_node
File "_ruamel_yaml.pyx", line 852, in _ruamel_yaml.CParser._compose_sequence_node
File "_ruamel_yaml.pyx", line 904, in _ruamel_yaml.CParser._parse_next_event
ruamel.yaml.scanner.ScannerError: while scanning a simple key
in "<unicode string>", line 12, column 3
could not find expected ':'
in "<unicode string>", line 13, column 8
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/cli/__init__.py", line 185, in main
ret = cmd.do_run()
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/cli/command.py", line 22, in do_run
return self.run()
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/commands/data.py", line 102, in run
status = self.repo.data_status(
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/data.py", line 287, in status
uncommitted_diff = _diff_index_to_wtree(repo, **kwargs)
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/data.py", line 181, in _diff_index_to_wtree
for out in repo.index.outs:
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/index.py", line 118, in outs
for stage in self:
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/index.py", line 92, in __iter__
yield from self.stages
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/funcy/objects.py", line 28, in __get__
res = instance.__dict__[self.fget.__name__] = self.fget(instance)
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/index.py", line 74, in stages
return self.stage_collector.collect_repo(onerror=onerror)
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/stage.py", line 516, in collect_repo
return list(self._collect_repo(onerror))
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/stage.py", line 499, in _collect_repo
new_stages = self.load_file(file_path)
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/stage.py", line 319, in load_file
return self.load_all(path)
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/repo/stage.py", line 291, in load_all
stages = dvcfile.stages # type: ignore
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/dvcfile.py", line 281, in stages
data, _ = self._load()
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/dvcfile.py", line 150, in _load
return self._load_yaml(**kwargs)
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/dvcfile.py", line 161, in _load_yaml
return strictyaml.load(
File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.8/site-packages/dvc/utils/strictyaml.py", line 296, in load
raise YAMLSyntaxError(path, text, exc, rev=rev) from cause
dvc.utils.strictyaml.YAMLSyntaxError: unable to read: 'dvc.yaml', YAML file structure is corrupted
------------------------------------------------------------
2022-08-26 12:32:37,836 DEBUG: Analytics is enabled.
2022-08-26 12:32:37,860 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmpthfj_0q0']'
2022-08-26 12:32:37,861 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmpthfj_0q0']'
I am fine with introducing the concept of partial results in dvc data status --json and exit with 2, but I am not sure if we need to do this in the CLI as well. At least I am not sure, how we can tell users that dvc.yaml is broken and how it is broken, which the above error provides in a nice way.
Failing on a broken file in the index and skipping from HEAD is an option for CLI.
For partial results and/or --json, we can skip broken files/stages from the index for uncommitted changes, and skip either of the broken files in the HEAD and the index.
I'd love to make data status infallible, but we have to do that in a way that the results are not misleading and the error is obvious to the users.
We can start with a simple way here, skip the broken file in the index, and assume that they are not broken in the HEAD. Since we are comparing between the workspace and the index using those same files, the results won't be misleading.
If we want to start simple, I agree with https://github.com/iterative/vscode-dvc/issues/2254#issuecomment-1227678984 that we should start with no changes except passing at least some error info via JSON or some format that will work for VS Code.
For JSON, do we need to have a general guideline for all commands that errors should be output in JSON?
For JSON, do we need to have a general guideline for all commands that errors should be output in JSON?
It would be good to head in that direction.
@mattseddon Given https://github.com/iterative/vscode-dvc/pull/2301, is this still an issue? Do you still need JSON errors? What info do you need and how important is it?
@mattseddon Given iterative/vscode-dvc#2301, is this still an issue? Do you still need JSON errors? What info do you need and how important is it?
For this particular issue, it would be really good to get a list of failures (unparsable files) back from the CLI, I could then show exactly what has failed in one of our views and add an on-click command to open any of those particular file(s) from the same view.
If I am being completely honest this is probably just nice to have, not a pressing priority. Others might feel differently but IMO the main reason the file would be corrupted is that you're currently editing it. In the case of VS Code it should be very obvious that it is broken due to schema validation.
Thanks @mattseddon! I'm a bit confused because it looks to me like https://github.com/iterative/vscode-dvc/pull/2301 already shows the broken file and even the source of the error within the file. What am I missing?
The missing part is having the path to the broken file(s) as a string (or list of strings) instead of having to parse the text to work it out.
In that example I display the entire stderr from the error message, VS Code truncates the string into the tree (probably at the first \n). When I hover over it the full string is shown. In order to get ./dvc.yaml from the error message, I would have to rely on the format of the string and/or use regex to parse it. I am not comfortable doing that.
Patched on the extension side