dvc icon indicating copy to clipboard operation
dvc copied to clipboard

`dvc list -R`: listing contents of data registry fails, when using recursive flag

Open hfrechen opened this issue 3 years ago • 5 comments
trafficstars

Bug Report

dvc list -R: listing contents of data registry fails, when using recursive flag

I setup a sample data registry containing the data generating code using a dvc.yaml pipeline. When trying to list the registry's content, dvc list works as intended and shows the top-level files and dirs of the repo. When using dvc list -R it fails with a TreeError. This seems to be similar to issue #7871 and a comment regarding TreeError can be found in the Discord channel as well.

Description

dvc list works as intended

$ dvc list -vv https://github.com/hfrechen/data-registry-test
2022-08-09 13:29:13,399 TRACE: Namespace(cprofile=False, yappi=False, viztracer=False, viztracer_depth=None, cprofile_dump=None, pdb=False, instrument=False, instrument_open=False, quiet=0, verbose=2, version=None, cd='.', cmd='list', url='https://github.com/hfrechen/data-registry-test', recursive=False, dvc_only=False, json=False, rev=None, path=None, func=<class 'dvc.commands.ls.CmdList'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2022-08-09 13:29:13,651 DEBUG: Creating external repo https://github.com/hfrechen/data-registry-test@None
2022-08-09 13:29:13,651 DEBUG: erepo: git clone 'https://github.com/hfrechen/data-registry-test' to a temporary dir
2022-08-09 13:29:14,895 TRACE: Context during resolution of stage data_import:               
{'data': {'path': {'raw': './data/raw', 'interim': './data/interim'}}}
2022-08-09 13:29:14,898 TRACE:    50.41 ms in collecting stages from /
2022-08-09 13:29:14,898 TRACE:     2.26 mks in collecting stages from /.dvc
2022-08-09 13:29:14,899 TRACE:     6.07 mks in collecting stages from /data
2022-08-09 13:29:14,899 TRACE:     4.54 mks in collecting stages from /data/interim
2022-08-09 13:29:14,899 TRACE:     3.82 mks in collecting stages from /data/raw
2022-08-09 13:29:14,899 TRACE:     3.96 mks in collecting stages from /src
.dvcignore
.gitignore
README.md
data
dvc.lock
dvc.yaml
params.yaml
src
2022-08-09 13:29:14,905 DEBUG: Analytics is enabled.
2022-08-09 13:29:14,958 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpai6phy17']'
2022-08-09 13:29:14,960 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpai6phy17']'

dvc list -R fails with TreeError

$ dvc list -R -vv https://github.com/hfrechen/data-registry-test
2022-08-09 13:29:37,067 TRACE: Namespace(cprofile=False, yappi=False, viztracer=False, viztracer_depth=None, cprofile_dump=None, pdb=False, instrument=False, instrument_open=False, quiet=0, verbose=2, version=None, cd='.', cmd='list', url='https://github.com/hfrechen/data-registry-test', recursive=True, dvc_only=False, json=False, rev=None, path=None, func=<class 'dvc.commands.ls.CmdList'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2022-08-09 13:29:37,180 DEBUG: Creating external repo https://github.com/hfrechen/data-registry-test@None
2022-08-09 13:29:37,181 DEBUG: erepo: git clone 'https://github.com/hfrechen/data-registry-test' to a temporary dir
2022-08-09 13:29:38,134 TRACE: Context during resolution of stage data_import:               
{'data': {'path': {'raw': './data/raw', 'interim': './data/interim'}}}
2022-08-09 13:29:38,137 TRACE:    32.35 ms in collecting stages from /
2022-08-09 13:29:38,137 TRACE:     2.09 mks in collecting stages from /.dvc
2022-08-09 13:29:38,137 TRACE:     6.68 mks in collecting stages from /data
2022-08-09 13:29:38,138 TRACE:     4.53 mks in collecting stages from /data/interim
2022-08-09 13:29:38,138 TRACE:     3.66 mks in collecting stages from /data/raw
2022-08-09 13:29:38,138 TRACE:     3.51 mks in collecting stages from /src
2022-08-09 13:29:38,156 ERROR: unexpected error
------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/cli/command.py", line 36, in do_run
    return self.run()
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/commands/ls/__init__.py", line 31, in run
    entries = Repo.ls(
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/repo/ls.py", line 46, in ls
    ret = _ls(repo, path, recursive, dvc_only)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/repo/ls.py", line 68, in _ls
    for root, dirs, files in fs.walk(
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/fsspec/spec.py", line 421, in walk
    yield from self.walk(d, maxdepth=maxdepth, detail=detail, **kwargs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/fsspec/spec.py", line 421, in walk
    yield from self.walk(d, maxdepth=maxdepth, detail=detail, **kwargs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/fsspec/spec.py", line 389, in walk
    listing = self.ls(path, detail=True, **kwargs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/fs/dvc.py", line 335, in ls
    for entry in dvc_fs.ls(dvc_path, detail=False):
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 318, in ls
    return self.fs.ls(path, detail=detail)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_data/fs.py", line 82, in ls
    for name in self.index.ls(prefix=root_key)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_data/index.py", line 130, in ls
    raise TreeError
dvc_data.objects.tree.TreeError
------------------------------------------------------------
2022-08-09 13:29:38,528 DEBUG: Version info for developers:
DVC version: 2.17.0 (conda)
---------------------------------
Platform: Python 3.9.13 on Linux-3.10.0-1160.45.1.el7.x86_64-x86_64-with-glibc2.31
Supports:
        hdfs (fsspec = 2022.7.1, pyarrow = 8.0.1),
        webhdfs (fsspec = 2022.7.1),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.7.0),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.7.0)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: xfs on /dev/mapper/centos-root
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-08-09 13:29:38,529 DEBUG: Analytics is enabled.
2022-08-09 13:29:38,591 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmp59k200wt']'
2022-08-09 13:29:38,595 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmp59k200wt']'

This seems to affect dvc import as well:

$ dvc import -vv https://github.com/hfrechen/data-registry-test -o data/interim data/interim    
2022-08-09 13:30:18,116 TRACE: Namespace(cprofile=False, yappi=False, viztracer=False, viztracer_depth=None, cprofile_dump=None, pdb=False, instrument=False, instrument_open=False, quiet=0, verbose=2, version=None, cd='.', cmd='import', url='https://github.com/hfrechen/data-registry-test', path='data/interim', out='data/interim', rev=None, file=None, no_exec=False, desc=None, jobs=None, func=<class 'dvc.commands.imp.CmdImport'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2022-08-09 13:30:18,402 TRACE:    66.82 mks in collecting stages from /home/dev/projects/data-consumer
2022-08-09 13:30:18,402 TRACE:    73.89 mks in collecting stages from /home/dev/projects/data-consumer/data
2022-08-09 13:30:18,402 TRACE:     1.97 mks in collecting stages from /home/dev/projects/data-consumer/data/interim
2022-08-09 13:30:18,677 DEBUG: Removing output 'data/interim/interim' of stage: 'data/interim/interim.dvc'.
2022-08-09 13:30:18,677 DEBUG: Removing '/home/dev/projects/data-consumer/data/interim/interim'
Importing 'data/interim (https://github.com/hfrechen/data-registry-test)' -> 'data/interim/interim'
2022-08-09 13:30:18,678 DEBUG: Computed stage: 'data/interim/interim.dvc' md5: '1e3c54ddb027f31952ea5d2c65f3ed8e'
2022-08-09 13:30:18,678 DEBUG: 'md5' of stage: 'data/interim/interim.dvc' changed.
2022-08-09 13:30:18,679 DEBUG: Creating external repo https://github.com/hfrechen/data-registry-test@None
2022-08-09 13:30:18,679 DEBUG: erepo: git clone 'https://github.com/hfrechen/data-registry-test' to a temporary dir
2022-08-09 13:30:19,651 DEBUG: Checking if stage '/data/interim' is in 'dvc.yaml'            
2022-08-09 13:30:19,687 TRACE: Context during resolution of stage data_import:
{'data': {'path': {'raw': './data/raw', 'interim': './data/interim'}}}
2022-08-09 13:30:19,690 TRACE:    37.28 ms in collecting stages from /
2022-08-09 13:30:19,690 TRACE:     2.45 mks in collecting stages from /.dvc
2022-08-09 13:30:19,690 TRACE:     7.00 mks in collecting stages from /data
2022-08-09 13:30:19,690 TRACE:     5.64 mks in collecting stages from /data/interim
2022-08-09 13:30:19,690 TRACE:     3.54 mks in collecting stages from /data/raw
2022-08-09 13:30:19,691 TRACE:     3.58 mks in collecting stages from /src
2022-08-09 13:30:19,697 ERROR: failed to import 'data/interim' from 'https://github.com/hfrechen/data-registry-test'. - The path 'data/interim' does not exist in the target repository 'https://github.com/hfrechen/data-registry-test' neither as a DVC output nor as a Git-tracked file.: 
------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/dependency/repo.py", line 134, in _get_used_and_obj
    object_store, _, obj = build(
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_data/build.py", line 245, in build
    meta, obj = _build_tree(
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_data/build.py", line 123, in _build_tree
    for root, _, fnames in walk_iter:
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/fsspec/spec.py", line 389, in walk
    listing = self.ls(path, detail=True, **kwargs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/fs/dvc.py", line 335, in ls
    for entry in dvc_fs.ls(dvc_path, detail=False):
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 318, in ls
    return self.fs.ls(path, detail=detail)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_data/fs.py", line 82, in ls
    for name in self.index.ls(prefix=root_key)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_data/index.py", line 130, in ls
    raise TreeError
dvc_data.objects.tree.TreeError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/commands/imp.py", line 15, in run
    self.repo.imp(
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/repo/imp.py", line 6, in imp
    return self.imp_url(
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/repo/scm_context.py", line 156, in run
    return method(repo, *args, **kw)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/repo/imp_url.py", line 83, in imp_url
    stage.run(jobs=jobs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/funcy/decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/stage/decorators.py", line 38, in rwlocked
    return call()
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/funcy/decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/stage/__init__.py", line 535, in run
    self._sync_import(dry, force, jobs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/funcy/decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/stage/decorators.py", line 38, in rwlocked
    return call()
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/funcy/decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/stage/__init__.py", line 559, in _sync_import
    sync_import(self, dry, force, jobs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/stage/imports.py", line 43, in sync_import
    stage.deps[0].download(stage.outs[0], jobs=jobs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/dependency/repo.py", line 68, in download
    for odb, objs in self.get_used_objs().items():
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/dependency/repo.py", line 97, in get_used_objs
    used, _ = self._get_used_and_obj(**kwargs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/dependency/repo.py", line 141, in _get_used_and_obj
    raise PathMissingError(
dvc.exceptions.PathMissingError: The path 'data/interim' does not exist in the target repository 'https://github.com/hfrechen/data-registry-test' neither as a DVC output nor as a Git-tracked file.
------------------------------------------------------------
2022-08-09 13:30:19,706 DEBUG: Analytics is enabled.
2022-08-09 13:30:19,914 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmp3p8liv51']'
2022-08-09 13:30:19,917 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmp3p8liv51']'

Reproduce

I created a sample repo https://github.com/hfrechen/data-registry-test to try it out. Either command fails for me on different combinations of Ubuntu and Windows clients for DVC 2.15, 2.16 and 2.17.

  1. dvc list -vv https://github.com/hfrechen/data-registry-test
  2. dvc list -R -vv https://github.com/hfrechen/data-registry-test
  3. dvc import https://github.com/hfrechen/data-registry-test -o data/interim data/interim -vv

Expected

dvc list -R should be listing all subdirectories and files contained in the data registry

Environment information

For me could be reproduced using a clean conda environment just with conda create -n dvc -c conda-forge dvc

Output of dvc doctor:

$ dvc doctor
DVC version: 2.17.0 (conda)
---------------------------------
Platform: Python 3.9.13 on Linux-3.10.0-1160.45.1.el7.x86_64-x86_64-with-glibc2.31
Supports:
        hdfs (fsspec = 2022.7.1, pyarrow = 8.0.1),
        webhdfs (fsspec = 2022.7.1),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.7.0),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.7.0)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: xfs on /dev/mapper/centos-root
Repo: dvc, git

Additional Information (if any):

I tried to debug and tracked it down to this line https://github.com/iterative/dvc-data/blob/main/src/dvc_data/index.py#L138 State of the local objects, entry.obj seems to be None:

(Pdb) prefix
('data', 'interim')
(Pdb) self._trie
Trie([(('data', 'interim'), DataIndexEntry(meta=<dvc_data.hashfile.meta.Meta object at 0x7f7e7b30a840>, obj=None, hash_info=<dvc_data.hashfile.hash_info.HashInfo object at 0x7f7e7b23c0c0>, odb=<dvc_data.db.local.LocalHashFileDB object at 0x7f7e7b2cf3a0>, remote=None, loaded=None))])
(Pdb) entry
DataIndexEntry(meta=<dvc_data.hashfile.meta.Meta object at 0x7f7e7b30a840>, obj=None, hash_info=<dvc_data.hashfile.hash_info.HashInfo object at 0x7f7e7b23c0c0>, odb=<dvc_data.db.local.LocalHashFileDB object at 0x7f7e7b2cf3a0>, remote=None, loaded=None)
(Pdb) entry.obj
(Pdb) entry.obj is None
True

hfrechen avatar Aug 09 '22 13:08 hfrechen

Should also note that the repo uses templating and the stage output which generates tree errors is a directory where the path comes from templated params

pmrowla avatar Aug 09 '22 23:08 pmrowla

Correct. In addition, by default DVC gitignores the output stage folder. I also tried in the beginning to "ungitignore" and to push the files to the git repo, so that the output folder and files actually exist in the registry (Link to rev). You can check that dvc list -R is also failing with the rev of the initial commit: dvc list -R -vv https://github.com/hfrechen/data-registry-test --rev 32b2e3970c7b0e0d5054ca3977b9e41d6720255a

hfrechen avatar Aug 10 '22 07:08 hfrechen

It doesn't look like there is any default remote set in https://github.com/hfrechen/data-registry-test, without which DVC can't granularly list the contents of directories it's tracking or import data. In DVC<=2.9.4, I get the error dvc.exceptions.NoRemoteInExternalRepoError: No DVC remote is specified in target repository 'https://github.com/hfrechen/data-registry-test'.. It's probably best to keep showing an error like this.

dberenbaum avatar Aug 12 '22 20:08 dberenbaum

This error message was really helpful. I resolved my issues in two steps, where step 1 might not be necessary:

  1. I was using a local remote in the beginning. I changed this to self hosting a MinIO S3 remote.
  2. I configured the new S3 remote in the data registry repo and in all consuming repos.

To move my remote from old to new, this SO Answer was helpful.

# Before setup (download all old remote cache to local machine):
dvc pull -r <old_remote_name> --all-commits --all-tags --all-branches

# After setup (upload all cache to a new remote):
dvc push -r <new_remote_name> --all-commits --all-tags --all-branches

hfrechen avatar Aug 17 '22 12:08 hfrechen

Just for your information. The TreeError appeared another time for me, again with the not so meaningful error message

2022-08-23 16:06:53,279 ERROR: unexpected error
------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/cli/command.py", line 36, in do_run
    return self.run()
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/commands/ls/__init__.py", line 31, in run
    entries = Repo.ls(
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/repo/ls.py", line 46, in ls
    ret = _ls(repo, path, recursive, dvc_only)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/repo/ls.py", line 68, in _ls
    for root, dirs, files in fs.walk(
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/fsspec/spec.py", line 421, in walk
    yield from self.walk(d, maxdepth=maxdepth, detail=detail, **kwargs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/fsspec/spec.py", line 421, in walk
    yield from self.walk(d, maxdepth=maxdepth, detail=detail, **kwargs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/fsspec/spec.py", line 421, in walk
    yield from self.walk(d, maxdepth=maxdepth, detail=detail, **kwargs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/fsspec/spec.py", line 389, in walk
    listing = self.ls(path, detail=True, **kwargs)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc/fs/dvc.py", line 335, in ls
    for entry in dvc_fs.ls(dvc_path, detail=False):
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 318, in ls
    return self.fs.ls(path, detail=detail)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_data/fs.py", line 82, in ls
    for name in self.index.ls(prefix=root_key)
  File "/opt/conda/envs/feature-store/lib/python3.9/site-packages/dvc_data/index.py", line 138, in ls
    raise TreeError
dvc_data.objects.tree.TreeError

Took me a while to figure out, what the reason was, because the default remote was set.

  1. I created another stage in my pipeline with new dependencies and new outputs.
  2. I ran dvc repro which was executed successfully
  3. I commited and pushed the changes to the git repo
  4. I forgot to dvc push and tried to import the new stage of the data registry in another repo. dvc list -R failed with TreeError, due to the non-existant new stage in the DVC remote

Maybe you could improve the error messages here as well. Something like this? Issue 1: "Default remote not set. Please configure in .dvc/config" Issue 2: "Files tracked in your registry could not be found on remote"

hfrechen avatar Aug 23 '22 16:08 hfrechen