dvc icon indicating copy to clipboard operation
dvc copied to clipboard

pull: Pulling a `.dvc` file with the option `-d` produces an AssertionError

Open courentin opened this issue 3 years ago • 14 comments

Bug Report

Description

When I run:

dvc pull -d data/KPI/tracking-metrics.dvc

I go an AssertionError:

  File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/graph.py", line 31, in get_pipeline
    assert len(found) == 1

Expected

I think using -d with a dvc file does not make sense but I was expecting a clearer error message. Also, I can't use the -d option to pull a stage and a dvc file with the same command line (ie. dvc -d data/KPI/tracking-metrics.dvc a_stage_in_dvc_yaml produces the AssertionError)

Additional Information (if any):

Full error message
2021-04-08 11:14:19,615 ERROR: unexpected error
------------------------------------------------------------
Traceback (most recent call last):
  File "/app/speech/.venv/lib/python3.7/site-packages/dvc/main.py", line 55, in main
    ret = cmd.run()
  File "/app/speech/.venv/lib/python3.7/site-packages/dvc/command/data_sync.py", line 40, in run
    glob=self.args.glob,
  File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/pull.py", line 38, in pull
    run_cache=run_cache,
  File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/fetch.py", line 53, in fetch
    revs=revs,
  File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/__init__.py", line 392, in used_cache
    for stage, filter_info in pairs:
  File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/__init__.py", line 388, in <genexpr>
    for target in targets
  File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/stage.py", line 421, in collect_granular
    accept_group=accept_group,
  File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/stage.py", line 366, in collect
    return _collect_with_deps(stages, graph or self.graph)
  File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/stage.py", line 55, in _collect_with_deps
    res.update(collect_pipeline(stage, graph=graph))
  File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/graph.py", line 44, in collect_pipeline
    pipeline = get_pipeline(get_pipelines(graph), stage)
  File "/app/speech/.venv/lib/python3.7/site-packages/dvc/repo/graph.py", line 31, in get_pipeline
    assert len(found) == 1
AssertionError
------------------------------------------------------------
2021-04-08 11:14:19,720 DEBUG: Version info for developers:
DVC version: 2.0.14 (pip)
---------------------------------
Platform: Python 3.7.5 on Linux-4.19.121-linuxkit-x86_64-with-Ubuntu-18.04-bionic
Supports: http, https, s3
Cache types: hardlink, symlink
Cache directory: overlay on overlay
Caches: local
Remotes: s3
Workspace directory: overlay on overlay
Repo: dvc, git

courentin avatar Apr 08 '21 11:04 courentin

@courentin, it looks like you have outputs duplication or some issue with graph correctness. Are you able to do dvc dag successfully?

skshetry avatar Apr 09 '21 09:04 skshetry

Yes dvc dag works, do you want the output? (it is a bit messy as we have a lot of stages)

courentin avatar Apr 09 '21 10:04 courentin

@courentin, yeah it's wrong that it expects only one pipeline.

https://github.com/iterative/dvc/blob/353e4cf3746345ee94a2d907d16c688e97fb1bfd/dvc/repo/graph.py#L41-L45

We should just do something similar to the following instead:

return chain.from_iterable(nx.dfs_postorder_nodes(graph, source=stage) for stage in stages)

skshetry avatar Apr 12 '21 10:04 skshetry

@skshetry Do you plans to look into this? Could we lower the priority?

dberenbaum avatar Feb 17 '22 20:02 dberenbaum

@skshetry Lowering the priority of this one for now. Let me know if you think we need to bump it.

dberenbaum avatar Apr 05 '22 17:04 dberenbaum

It looks like with version 2.20.0 this happens even without the -d flag.

DVC version: 2.20.0 (pip)
---------------------------------
Platform: Python 3.9.5 on Linux-3.10.0-1160.49.1.el7.x86_64-x86_64-with-glibc2.17
Supports:
	azure (adlfs = 2022.4.0, knack = 0.9.0, azure-identity = 1.10.0),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2022.3.0, boto3 = 1.21.21)
Cache types: symlink
Cache directory: beegfs on beegfs_scratch
Caches: local
Remotes: s3
Workspace directory: beegfs on beegfs_scratch
Repo: dvc, git

As a consequence, it is impossible to download the input data. Is there any chance to fix this bug soon?

macio232 avatar Aug 26 '22 14:08 macio232

@dberenbaum could you please increase the priority? Due to this bug dvc import practically doesn't make sense.

macio232 avatar Sep 08 '22 09:09 macio232

@macio232 Do you have a reproducible example or can you describe in more detail the issue you are having?

dberenbaum avatar Sep 08 '22 12:09 dberenbaum

@dberenbaum I have a .dvc file created with dvc import with an older version of DVC. Now, with 2.20.0, when I want to pull the data with dvc pull I get the following error:

2022-09-13 10:54:47,645 DEBUG: Creating external repo (...)
2022-09-13 10:54:47,646 DEBUG: erepo: git clone '(...)' to a temporary dir
2022-09-13 10:54:51,901 DEBUG: Checking if stage '(...)' is in 'dvc.yaml'                                                                                                      
2022-09-13 10:54:52,423 DEBUG: built tree 'object 5c679943aa5162b343d6c689e7db33c7.dir'                                                                                                       
2022-09-13 10:54:52,425 DEBUG: Preparing to transfer data from '(...)' to '(...)/.dvc/cache'
2022-09-13 10:54:52,425 DEBUG: Preparing to collect status from '(...)/.dvc/cache'
2022-09-13 10:54:52,425 DEBUG: Collecting status from '(...)/.dvc/cache'
2022-09-13 10:54:52,428 DEBUG: Preparing to transfer data from 'memory://dvc-staging/6c3c0693a025d1b1c3e0768f64dec626da1fda3c0ef100217c4f7cacfb0895ef' to '(...)/.dvc/cache'
2022-09-13 10:54:52,428 DEBUG: Preparing to collect status from '(...)/.dvc/cache'
2022-09-13 10:54:52,428 DEBUG: Collecting status from '(...)/.dvc/cache'
2022-09-13 10:54:52,536 ERROR: unexpected error                                                                                                                                               
------------------------------------------------------------
Traceback (most recent call last):
  File "(...)/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "(...)/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "(...)/lib/python3.9/site-packages/dvc/commands/data_sync.py", line 31, in run
    stats = self.repo.pull(
  File "(...)/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "(...)/lib/python3.9/site-packages/dvc/repo/pull.py", line 34, in pull
    processed_files_count = self.fetch(
  File "(...)/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "(...)/lib/python3.9/site-packages/dvc/repo/fetch.py", line 96, in fetch
    d, f = _fetch_partial_imports(
  File "(...)/lib/python3.9/site-packages/dvc/repo/fetch.py", line 129, in _fetch_partial_imports
    for stage in repo.partial_imports(targets, **kwargs):
  File "(...)/lib/python3.9/site-packages/dvc/repo/__init__.py", line 484, in partial_imports
    return list(partial_imports)
  File "(...)/lib/python3.9/site-packages/dvc/repo/__init__.py", line 472, in <genexpr>
    self.index.partial_imports(targets, recursive=recursive)
  File "(...)/lib/python3.9/site-packages/dvc/repo/index.py", line 269, in partial_imports
    return [stage for stage, _ in pairs if stage.is_partial_import]
  File "(...)/lib/python3.9/site-packages/dvc/repo/index.py", line 269, in <listcomp>
    return [stage for stage, _ in pairs if stage.is_partial_import]
  File "(...)/lib/python3.9/site-packages/dvc/repo/index.py", line 263, in <genexpr>
    self.stage_collector.collect_granular(
  File "(...)/lib/python3.9/site-packages/dvc/repo/stage.py", line 438, in collect_granular
    stages = self.collect(
  File "(...)/lib/python3.9/site-packages/dvc/repo/stage.py", line 387, in collect
    return _collect_with_deps(stages, graph or self.graph)
  File "(...)/lib/python3.9/site-packages/dvc/repo/stage.py", line 64, in _collect_with_deps
    res.update(collect_pipeline(stage, graph=graph))
  File "(...)/lib/python3.9/site-packages/dvc/repo/graph.py", line 45, in collect_pipeline
    pipeline = get_pipeline(get_pipelines(graph), stage)
  File "(...)/lib/python3.9/site-packages/dvc/repo/graph.py", line 32, in get_pipeline
    assert len(found) == 1
AssertionError
------------------------------------------------------------
2022-09-13 10:54:52,668 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2022-09-13 10:54:52,668 DEBUG: Removing '(...)/.7wqnWLCubyeWzriBzVS6C2.tmp'
2022-09-13 10:54:52,668 DEBUG: Removing '(...)/.7wqnWLCubyeWzriBzVS6C2.tmp'
2022-09-13 10:54:52,669 DEBUG: Removing '(...)/.7wqnWLCubyeWzriBzVS6C2.tmp'
2022-09-13 10:54:52,669 DEBUG: Removing '(...)/.dvc/cache/.fq2mwKNNqNycCdF5sQnZRP.tmp'
2022-09-13 10:54:52,672 DEBUG: Version info for developers:
DVC version: 2.20.0 (pip)
---------------------------------
Platform: Python 3.9.13 on Linux-5.15.0-47-generic-x86_64-with-glibc2.35
Supports:
	azure (adlfs = 2022.7.0, knack = 0.10.0, azure-identity = 1.10.0),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2022.7.1, boto3 = 1.21.21)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/ubuntu--vg-home--vg
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/mapper/ubuntu--vg-home--vg
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-09-13 10:54:52,673 DEBUG: Analytics is enabled.
2022-09-13 10:54:52,750 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmp4g__3xg7']'
2022-09-13 10:54:52,752 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmp4g__3xg7']

The data is downloaded, but it isn't checked out. This isn't a problem when you just want to download data in a new location because you can just use dvc checkout. Unfortunately, when you update rev_lock in the .dvc file, you want the md5s to be updated, which doesn't happen. Consequently, dvc import becomes unusable and I think this bug deserves a high priority.

macio232 avatar Sep 13 '22 09:09 macio232

Thanks @macio232! Could you provide the dvc.yaml or a simplified/redacted version of it? Trying to figure out how I could reproduce the issue.

@skshetry Any thoughts?

dberenbaum avatar Sep 13 '22 19:09 dberenbaum

I think the problem is related to this dummy stage data@*, which I added to avoid problems with experiments execution in a temporary workspace (dvc exp run --temp) - it fails at copying/symlinking data added with a .dvc file. Below is a shortened version of my dvc.yaml file which is located in ./projects/project_name/ directory:

vars:
  - seed: 2021
  - experiments:
    - 1
    - 2
    - 3

stages:
  data:
    foreach: ${experiments}
    do:
      wdir: .
      cmd: mkdir -p ./data/raw && cp -r ../../data/raw/${item} ./data/raw/${item}
      deps:
        - ../../data/raw/${item}
      outs:
        - ./data/raw/${item}

macio232 avatar Sep 14 '22 07:09 macio232

@macio232 Thanks for that. And what data is being imported? What does that .dvc file look like?

dberenbaum avatar Sep 14 '22 20:09 dberenbaum

@dberenbaum a folder that is an output of a stage in a different repository

frozen: true
deps:
- path: data_path_in_source_repository
  repo:
    url: git@(...).git
    rev: master
    rev_lock: 3b2e5fc82db56e09d41add7ab230bf1f00a89927
outs:
- path: data_path_in_current_repository
  md5: 5c679943aa5162b343d6c689e7db33c7.dir
  size: 117544946
  nfiles: 358
md5: aec46d8c5dc43f767b6c68185f5030be

macio232 avatar Sep 15 '22 10:09 macio232

@macio232 Would you mind opening a separate issue since it seems like it's not clear if the issue is the same? Only the error message is the same as far as I understand, correct?

I think the problem is related to this dummy stage data@*, which I added to avoid problems with experiments execution in a temporary workspace (dvc exp run --temp) - it fails at copying/symlinking data added with a .dvc file.

Can you dig into this a bit more? In that example above, what is the path of the .dvc file?

dberenbaum avatar Sep 16 '22 13:09 dberenbaum

Closing as stale

efiop avatar Oct 26 '23 18:10 efiop