dvc
dvc copied to clipboard
pull: Pulling a `.dvc` file with the option `-d` produces an AssertionError
Bug Report
Description
When I run:
dvc pull -d data/KPI/tracking-metrics.dvc
I go an AssertionError:
File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/graph.py", line 31, in get_pipeline
assert len(found) == 1
Expected
I think using -d
with a dvc file does not make sense but I was expecting a clearer error message.
Also, I can't use the -d
option to pull a stage and a dvc file with the same command line (ie. dvc -d data/KPI/tracking-metrics.dvc a_stage_in_dvc_yaml
produces the AssertionError)
Additional Information (if any):
Full error message
2021-04-08 11:14:19,615 ERROR: unexpected error
------------------------------------------------------------
Traceback (most recent call last):
File "/app/speech/.venv/lib/python3.7/site-packages/dvc/main.py", line 55, in main
ret = cmd.run()
File "/app/speech/.venv/lib/python3.7/site-packages/dvc/command/data_sync.py", line 40, in run
glob=self.args.glob,
File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/pull.py", line 38, in pull
run_cache=run_cache,
File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/fetch.py", line 53, in fetch
revs=revs,
File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/__init__.py", line 392, in used_cache
for stage, filter_info in pairs:
File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/__init__.py", line 388, in <genexpr>
for target in targets
File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/stage.py", line 421, in collect_granular
accept_group=accept_group,
File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/stage.py", line 366, in collect
return _collect_with_deps(stages, graph or self.graph)
File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/stage.py", line 55, in _collect_with_deps
res.update(collect_pipeline(stage, graph=graph))
File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/graph.py", line 44, in collect_pipeline
pipeline = get_pipeline(get_pipelines(graph), stage)
File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/graph.py", line 31, in get_pipeline
assert len(found) == 1
AssertionError
------------------------------------------------------------
2021-04-08 11:14:19,720 DEBUG: Version info for developers:
DVC version: 2.0.14 (pip)
---------------------------------
Platform: Python 3.7.5 on Linux-4.19.121-linuxkit-x86_64-with-Ubuntu-18.04-bionic
Supports: http, https, s3
Cache types: hardlink, symlink
Cache directory: overlay on overlay
Caches: local
Remotes: s3
Workspace directory: overlay on overlay
Repo: dvc, git
@courentin, it looks like you have outputs duplication or some issue with graph correctness. Are you able to do dvc dag
successfully?
Yes dvc dag
works, do you want the output? (it is a bit messy as we have a lot of stages)
@courentin, yeah it's wrong that it expects only one pipeline.
https://github.com/iterative/dvc/blob/353e4cf3746345ee94a2d907d16c688e97fb1bfd/dvc/repo/graph.py#L41-L45
We should just do something similar to the following instead:
return chain.from_iterable(nx.dfs_postorder_nodes(graph, source=stage) for stage in stages)
@skshetry Do you plans to look into this? Could we lower the priority?
@skshetry Lowering the priority of this one for now. Let me know if you think we need to bump it.
It looks like with version 2.20.0
this happens even without the -d
flag.
DVC version: 2.20.0 (pip)
---------------------------------
Platform: Python 3.9.5 on Linux-3.10.0-1160.49.1.el7.x86_64-x86_64-with-glibc2.17
Supports:
azure (adlfs = 2022.4.0, knack = 0.9.0, azure-identity = 1.10.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
s3 (s3fs = 2022.3.0, boto3 = 1.21.21)
Cache types: symlink
Cache directory: beegfs on beegfs_scratch
Caches: local
Remotes: s3
Workspace directory: beegfs on beegfs_scratch
Repo: dvc, git
As a consequence, it is impossible to download the input data. Is there any chance to fix this bug soon?
@dberenbaum could you please increase the priority? Due to this bug dvc import
practically doesn't make sense.
@macio232 Do you have a reproducible example or can you describe in more detail the issue you are having?
@dberenbaum I have a .dvc
file created with dvc import
with an older version of DVC. Now, with 2.20.0, when I want to pull the data with dvc pull
I get the following error:
2022-09-13 10:54:47,645 DEBUG: Creating external repo (...)
2022-09-13 10:54:47,646 DEBUG: erepo: git clone '(...)' to a temporary dir
2022-09-13 10:54:51,901 DEBUG: Checking if stage '(...)' is in 'dvc.yaml'
2022-09-13 10:54:52,423 DEBUG: built tree 'object 5c679943aa5162b343d6c689e7db33c7.dir'
2022-09-13 10:54:52,425 DEBUG: Preparing to transfer data from '(...)' to '(...)/.dvc/cache'
2022-09-13 10:54:52,425 DEBUG: Preparing to collect status from '(...)/.dvc/cache'
2022-09-13 10:54:52,425 DEBUG: Collecting status from '(...)/.dvc/cache'
2022-09-13 10:54:52,428 DEBUG: Preparing to transfer data from 'memory://dvc-staging/6c3c0693a025d1b1c3e0768f64dec626da1fda3c0ef100217c4f7cacfb0895ef' to '(...)/.dvc/cache'
2022-09-13 10:54:52,428 DEBUG: Preparing to collect status from '(...)/.dvc/cache'
2022-09-13 10:54:52,428 DEBUG: Collecting status from '(...)/.dvc/cache'
2022-09-13 10:54:52,536 ERROR: unexpected error
------------------------------------------------------------
Traceback (most recent call last):
File "(...)/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
ret = cmd.do_run()
File "(...)/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
return self.run()
File "(...)/lib/python3.9/site-packages/dvc/commands/data_sync.py", line 31, in run
stats = self.repo.pull(
File "(...)/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "(...)/lib/python3.9/site-packages/dvc/repo/pull.py", line 34, in pull
processed_files_count = self.fetch(
File "(...)/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "(...)/lib/python3.9/site-packages/dvc/repo/fetch.py", line 96, in fetch
d, f = _fetch_partial_imports(
File "(...)/lib/python3.9/site-packages/dvc/repo/fetch.py", line 129, in _fetch_partial_imports
for stage in repo.partial_imports(targets, **kwargs):
File "(...)/lib/python3.9/site-packages/dvc/repo/__init__.py", line 484, in partial_imports
return list(partial_imports)
File "(...)/lib/python3.9/site-packages/dvc/repo/__init__.py", line 472, in <genexpr>
self.index.partial_imports(targets, recursive=recursive)
File "(...)/lib/python3.9/site-packages/dvc/repo/index.py", line 269, in partial_imports
return [stage for stage, _ in pairs if stage.is_partial_import]
File "(...)/lib/python3.9/site-packages/dvc/repo/index.py", line 269, in <listcomp>
return [stage for stage, _ in pairs if stage.is_partial_import]
File "(...)/lib/python3.9/site-packages/dvc/repo/index.py", line 263, in <genexpr>
self.stage_collector.collect_granular(
File "(...)/lib/python3.9/site-packages/dvc/repo/stage.py", line 438, in collect_granular
stages = self.collect(
File "(...)/lib/python3.9/site-packages/dvc/repo/stage.py", line 387, in collect
return _collect_with_deps(stages, graph or self.graph)
File "(...)/lib/python3.9/site-packages/dvc/repo/stage.py", line 64, in _collect_with_deps
res.update(collect_pipeline(stage, graph=graph))
File "(...)/lib/python3.9/site-packages/dvc/repo/graph.py", line 45, in collect_pipeline
pipeline = get_pipeline(get_pipelines(graph), stage)
File "(...)/lib/python3.9/site-packages/dvc/repo/graph.py", line 32, in get_pipeline
assert len(found) == 1
AssertionError
------------------------------------------------------------
2022-09-13 10:54:52,668 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2022-09-13 10:54:52,668 DEBUG: Removing '(...)/.7wqnWLCubyeWzriBzVS6C2.tmp'
2022-09-13 10:54:52,668 DEBUG: Removing '(...)/.7wqnWLCubyeWzriBzVS6C2.tmp'
2022-09-13 10:54:52,669 DEBUG: Removing '(...)/.7wqnWLCubyeWzriBzVS6C2.tmp'
2022-09-13 10:54:52,669 DEBUG: Removing '(...)/.dvc/cache/.fq2mwKNNqNycCdF5sQnZRP.tmp'
2022-09-13 10:54:52,672 DEBUG: Version info for developers:
DVC version: 2.20.0 (pip)
---------------------------------
Platform: Python 3.9.13 on Linux-5.15.0-47-generic-x86_64-with-glibc2.35
Supports:
azure (adlfs = 2022.7.0, knack = 0.10.0, azure-identity = 1.10.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
s3 (s3fs = 2022.7.1, boto3 = 1.21.21)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/ubuntu--vg-home--vg
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/mapper/ubuntu--vg-home--vg
Repo: dvc, git
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-09-13 10:54:52,673 DEBUG: Analytics is enabled.
2022-09-13 10:54:52,750 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmp4g__3xg7']'
2022-09-13 10:54:52,752 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmp4g__3xg7']
The data is downloaded, but it isn't checked out. This isn't a problem when you just want to download data in a new location because you can just use dvc checkout
. Unfortunately, when you update rev_lock
in the .dvc
file, you want the md5
s to be updated, which doesn't happen. Consequently, dvc import
becomes unusable and I think this bug deserves a high priority.
Thanks @macio232! Could you provide the dvc.yaml
or a simplified/redacted version of it? Trying to figure out how I could reproduce the issue.
@skshetry Any thoughts?
I think the problem is related to this dummy stage data@*
, which I added to avoid problems with experiments execution in a temporary workspace (dvc exp run --temp
) - it fails at copying/symlinking data added with a .dvc
file. Below is a shortened version of my dvc.yaml
file which is located in ./projects/project_name/
directory:
vars:
- seed: 2021
- experiments:
- 1
- 2
- 3
stages:
data:
foreach: ${experiments}
do:
wdir: .
cmd: mkdir -p ./data/raw && cp -r ../../data/raw/${item} ./data/raw/${item}
deps:
- ../../data/raw/${item}
outs:
- ./data/raw/${item}
@macio232 Thanks for that. And what data is being imported? What does that .dvc
file look like?
@dberenbaum a folder that is an output of a stage in a different repository
frozen: true
deps:
- path: data_path_in_source_repository
repo:
url: git@(...).git
rev: master
rev_lock: 3b2e5fc82db56e09d41add7ab230bf1f00a89927
outs:
- path: data_path_in_current_repository
md5: 5c679943aa5162b343d6c689e7db33c7.dir
size: 117544946
nfiles: 358
md5: aec46d8c5dc43f767b6c68185f5030be
@macio232 Would you mind opening a separate issue since it seems like it's not clear if the issue is the same? Only the error message is the same as far as I understand, correct?
I think the problem is related to this dummy stage
data@*
, which I added to avoid problems with experiments execution in a temporary workspace (dvc exp run --temp
) - it fails at copying/symlinking data added with a.dvc
file.
Can you dig into this a bit more? In that example above, what is the path of the .dvc
file?
Closing as stale