dvc icon indicating copy to clipboard operation
dvc copied to clipboard

obj/s3: directory checksum for s3 fails with NotImplementedError

Open pmrowla opened this issue 4 years ago • 7 comments

Reported from multiple users when adding/importing a directory on s3, the issue occurs in both import-url and add --external. Importing an individual file and get-url work normally as expected.

(reported against both 2.8.2 - s3 (s3fs = 2021.10.1, boto3 = 1.19.7) and 2.8.3 - s3 (s3fs = 2021.11.0, boto3 = 1.17.106)

dvc import-url --file data/raw.dvc s3://test/sample data/raw -v
2021-11-21 22:10:16,748 DEBUG: Lockfile 'dvc.lock' needs to be updated.
2021-11-21 22:10:16,834 DEBUG: Removing output 'data/raw/sample' of stage: 'data/raw.dvc'.
2021-11-21 22:10:16,834 DEBUG: Removing 'data/raw/sample'
Importing 's3://test/sample' -> 'data/raw/sample'
2021-11-21 22:10:16,839 DEBUG: Computed stage: 'data/raw.dvc' md5: '782d7c58160f763093fa1761aaea4bc5'
2021-11-21 22:10:16,839 DEBUG: 'md5' of stage: 'data/raw.dvc' changed.
2021-11-21 22:10:19,085 ERROR: unexpected error                                                                                                                                                
------------------------------------------------------------
Traceback (most recent call last):
...
    _, self.meta, obj = ostage(
  File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 296, in stage
    meta, obj = _stage_tree(
  File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 170, in _stage_tree
    meta, tree = _build_tree(path_info, fs, name, odb=odb, **kwargs)
  File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 138, in _build_tree
    for file_info, meta, obj in _iter_objects(path_info, fs, name, **kwargs):
  File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 130, in _iter_objects
    yield from _build_objects(path_info, fs, name, **kwargs)
  File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 126, in _build_objects
    yield from executor.map(worker, walk_iterator)
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 611, in result_iterator
    yield fs.pop().result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/progress.py", line 133, in wrapped
    res = fn(*args, **kwargs)
  File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 83, in _stage_file
    meta, hash_info = get_file_hash(path_info, fs, name, state=state)
  File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 72, in get_file_hash
    meta, hash_info = _get_file_hash(path_info, fs, name)
  File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 57, in _get_file_hash
    raise NotImplementedError
NotImplementedError

add --external:

------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/envs/dvc/lib/python3.9/site-packages/dvc/main.py", line 55, in main
    ret = cmd.do_run()
  File "/opt/conda/envs/dvc/lib/python3.9/site-packages/dvc/command/base.py", line 45, in do_run
    return self.run()
  File "/opt/conda/envs/dvc/lib/python3.9/site-packages/dvc/command/add.py", line 21, in run
    self.repo.add(
  File "/opt/conda/envs/dvc/lib/python3.9/site-packages/dvc/utils/collections.py", line 163, in inner
    result = func(*ba.args, **ba.kwargs)
...
  File "/opt/conda/envs/dvc/lib/python3.9/site-packages/dvc/objects/stage.py", line 102, in _stage_file
    meta, hash_info = get_file_hash(path_info, fs, name, state=state)
  File "/opt/conda/envs/dvc/lib/python3.9/site-packages/dvc/objects/stage.py", line 90, in get_file_hash
    meta, hash_info = _get_file_hash(path_info, fs, name)
  File "/opt/conda/envs/dvc/lib/python3.9/site-packages/dvc/objects/stage.py", line 71, in _get_file_hash
    raise NotImplementedError
NotImplementedError

discord context (with full tracebacks):

  • add --external: https://discord.com/channels/485586884165107732/485596304961962003/911253104694018058
  • import-url: https://discord.com/channels/485586884165107732/485596304961962003/912074110790680646

pmrowla avatar Nov 23 '21 03:11 pmrowla

I was unable to reproduce the issue with a simple test case and the dvc-temp bucket, will need to do more investigation into determining what causes the problem

pmrowla avatar Nov 23 '21 03:11 pmrowla

for the record: From previous reports, it looked like this happens when operating on the whole bucket instead of a particular directory within it.

efiop avatar Nov 23 '21 03:11 efiop

I am facing similar error as following , after running dvc add --external s3://bucket_name/my_folder

Note : I am trying add particular existing data in my_folder in S3. The folder contains sub-dirs /train, /val, /test. my_folder do NOT contain any cache sub folders.

Adding...                                                                                
2021-12-01 14:54:09,699 ERROR: unexpected error                                          
------------------------------------------------------------                             
Traceback (most recent call last):                                                       
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/main.py", line 55, in main
    ret = cmd.do_run()
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/command/base.py", line 45, in do_run
    return self.run()
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/command/add.py", line 21, in run
    self.repo.add(
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/utils/collections.py", line 163, in inner
    result = func(*ba.args, **ba.kwargs)
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 50, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/repo/scm_context.py", line 14, in run
    return method(repo, *args, **kw)
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/repo/add.py", line 190, in add
    stage.save()
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/stage/__init__.py", line 457, in save
    self.save_outs(allow_missing=allow_missing)
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/stage/__init__.py", line 477, in save_outs
    out.save()
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/output.py", line 558, in save
    _, self.meta, self.obj = ostage(
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 296, in stage
    meta, obj = _stage_tree(
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 170, in _stage_tree
    meta, tree = _build_tree(path_info, fs, name, odb=odb, **kwargs)
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 138, in _build_tree
    for file_info, meta, obj in _iter_objects(path_info, fs, name, **kwargs):
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 130, in _iter_objects
    yield from _build_objects(path_info, fs, name, **kwargs)
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 126, in _build_objects
    yield from executor.map(worker, walk_iterator)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 608, in result_iterator
    yield fs.pop().result()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 438, in result
    return self.__get_result()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/progress.py", line 133, in wrapped
    res = fn(*args, **kwargs)
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 83, in _stage_file
    meta, hash_info = get_file_hash(path_info, fs, name, state=state)
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 72, in get_file_hash
    meta, hash_info = _get_file_hash(path_info, fs, name)
  File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 57, in _get_file_hash
    raise NotImplementedError

arq-divjyotsingh avatar Dec 01 '21 04:12 arq-divjyotsingh

The same bug was observed, when using gcsfs.

Debugging it, for _get_file_hash call, the fs_path is a folder, filesystem is GSFileSystem and name is etag. At the same time, info does not have etag for folders, so it lands in NotImplemented part:

def _get_file_hash(fs_path, fs, name):
    info = _adapt_info(fs.info(fs_path), fs.scheme)

    if name in info:
        assert not info[name].endswith(".dir")
        hash_value = info[name]
    elif hasattr(fs, name):
        func = getattr(fs, name)
        hash_value = func(fs_path)
    elif name == "md5":
        hash_value = file_md5(fs_path, fs)
    else:
        raise NotImplementedError

    meta = Meta(size=info["size"])
    hash_info = HashInfo(name, hash_value)
    return meta, hash_info

What info contains for the folder:

{'bucket': 'our-bucket-here', 'name': 'our-bucket-here/folder/subfolder', 'size': 0, 'storageClass': 'DIRECTORY', 'type': 'directory'}

I strongly suspect, the same kind of behavior is happening with S3 folders, but haven't verified it yet.

Maybe, this information leads you to some ideas?

Update I have found this particular comment: https://github.com/iterative/dvc/issues/5527#issuecomment-788221729

Does this mean, the support for external on bucket objects is dropped permanently?

VOvchinnikov avatar Feb 10 '22 16:02 VOvchinnikov

@VOvchinnikov Support for external outputs in GCS was unfortunately dropped in DVC 2.0. See https://dvc.org/blog/dvc-2-0-release#breaking-changes. We would like to support it again in the future, but it is not being worked on at the moment, and would likely only happen after a refactor of external outputs.

cc @efiop

dberenbaum avatar Feb 10 '22 21:02 dberenbaum

@pmrowla @efiop Do you want to keep this as p1? Is there a plan to look into it further soon?

dberenbaum avatar Feb 14 '22 21:02 dberenbaum

I think this can be p2 since the problem mostly occurs in the --external use case. But I think this is still a regression for import-url so it should be looked into some more at some point

pmrowla avatar Feb 16 '22 07:02 pmrowla