dvc
dvc copied to clipboard
obj/s3: directory checksum for s3 fails with NotImplementedError
Reported from multiple users when adding/importing a directory on s3, the issue occurs in both import-url and add --external. Importing an individual file and get-url work normally as expected.
(reported against both 2.8.2 - s3 (s3fs = 2021.10.1, boto3 = 1.19.7) and 2.8.3 - s3 (s3fs = 2021.11.0, boto3 = 1.17.106)
dvc import-url --file data/raw.dvc s3://test/sample data/raw -v
2021-11-21 22:10:16,748 DEBUG: Lockfile 'dvc.lock' needs to be updated.
2021-11-21 22:10:16,834 DEBUG: Removing output 'data/raw/sample' of stage: 'data/raw.dvc'.
2021-11-21 22:10:16,834 DEBUG: Removing 'data/raw/sample'
Importing 's3://test/sample' -> 'data/raw/sample'
2021-11-21 22:10:16,839 DEBUG: Computed stage: 'data/raw.dvc' md5: '782d7c58160f763093fa1761aaea4bc5'
2021-11-21 22:10:16,839 DEBUG: 'md5' of stage: 'data/raw.dvc' changed.
2021-11-21 22:10:19,085 ERROR: unexpected error
------------------------------------------------------------
Traceback (most recent call last):
...
_, self.meta, obj = ostage(
File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 296, in stage
meta, obj = _stage_tree(
File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 170, in _stage_tree
meta, tree = _build_tree(path_info, fs, name, odb=odb, **kwargs)
File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 138, in _build_tree
for file_info, meta, obj in _iter_objects(path_info, fs, name, **kwargs):
File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 130, in _iter_objects
yield from _build_objects(path_info, fs, name, **kwargs)
File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 126, in _build_objects
yield from executor.map(worker, walk_iterator)
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 611, in result_iterator
yield fs.pop().result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/progress.py", line 133, in wrapped
res = fn(*args, **kwargs)
File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 83, in _stage_file
meta, hash_info = get_file_hash(path_info, fs, name, state=state)
File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 72, in get_file_hash
meta, hash_info = _get_file_hash(path_info, fs, name)
File "/root/.cache/pypoetry/virtualenvs/qt-expressions-classification-wj407T3L-py3.8/lib/python3.8/site-packages/dvc/objects/stage.py", line 57, in _get_file_hash
raise NotImplementedError
NotImplementedError
add --external:
------------------------------------------------------------
Traceback (most recent call last):
File "/opt/conda/envs/dvc/lib/python3.9/site-packages/dvc/main.py", line 55, in main
ret = cmd.do_run()
File "/opt/conda/envs/dvc/lib/python3.9/site-packages/dvc/command/base.py", line 45, in do_run
return self.run()
File "/opt/conda/envs/dvc/lib/python3.9/site-packages/dvc/command/add.py", line 21, in run
self.repo.add(
File "/opt/conda/envs/dvc/lib/python3.9/site-packages/dvc/utils/collections.py", line 163, in inner
result = func(*ba.args, **ba.kwargs)
...
File "/opt/conda/envs/dvc/lib/python3.9/site-packages/dvc/objects/stage.py", line 102, in _stage_file
meta, hash_info = get_file_hash(path_info, fs, name, state=state)
File "/opt/conda/envs/dvc/lib/python3.9/site-packages/dvc/objects/stage.py", line 90, in get_file_hash
meta, hash_info = _get_file_hash(path_info, fs, name)
File "/opt/conda/envs/dvc/lib/python3.9/site-packages/dvc/objects/stage.py", line 71, in _get_file_hash
raise NotImplementedError
NotImplementedError
discord context (with full tracebacks):
add --external: https://discord.com/channels/485586884165107732/485596304961962003/911253104694018058import-url: https://discord.com/channels/485586884165107732/485596304961962003/912074110790680646
I was unable to reproduce the issue with a simple test case and the dvc-temp bucket, will need to do more investigation into determining what causes the problem
for the record: From previous reports, it looked like this happens when operating on the whole bucket instead of a particular directory within it.
I am facing similar error as following , after running dvc add --external s3://bucket_name/my_folder
Note : I am trying add particular existing data in my_folder in S3. The folder contains sub-dirs /train, /val, /test. my_folder do NOT contain any cache sub folders.
Adding...
2021-12-01 14:54:09,699 ERROR: unexpected error
------------------------------------------------------------
Traceback (most recent call last):
File "/Users/........./venv/lib/python3.9/site-packages/dvc/main.py", line 55, in main
ret = cmd.do_run()
File "/Users/........./venv/lib/python3.9/site-packages/dvc/command/base.py", line 45, in do_run
return self.run()
File "/Users/........./venv/lib/python3.9/site-packages/dvc/command/add.py", line 21, in run
self.repo.add(
File "/Users/........./venv/lib/python3.9/site-packages/dvc/utils/collections.py", line 163, in inner
result = func(*ba.args, **ba.kwargs)
File "/Users/........./venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 50, in wrapper
return f(repo, *args, **kwargs)
File "/Users/........./venv/lib/python3.9/site-packages/dvc/repo/scm_context.py", line 14, in run
return method(repo, *args, **kw)
File "/Users/........./venv/lib/python3.9/site-packages/dvc/repo/add.py", line 190, in add
stage.save()
File "/Users/........./venv/lib/python3.9/site-packages/dvc/stage/__init__.py", line 457, in save
self.save_outs(allow_missing=allow_missing)
File "/Users/........./venv/lib/python3.9/site-packages/dvc/stage/__init__.py", line 477, in save_outs
out.save()
File "/Users/........./venv/lib/python3.9/site-packages/dvc/output.py", line 558, in save
_, self.meta, self.obj = ostage(
File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 296, in stage
meta, obj = _stage_tree(
File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 170, in _stage_tree
meta, tree = _build_tree(path_info, fs, name, odb=odb, **kwargs)
File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 138, in _build_tree
for file_info, meta, obj in _iter_objects(path_info, fs, name, **kwargs):
File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 130, in _iter_objects
yield from _build_objects(path_info, fs, name, **kwargs)
File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 126, in _build_objects
yield from executor.map(worker, walk_iterator)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 608, in result_iterator
yield fs.pop().result()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 438, in result
return self.__get_result()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
raise self._exception
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/thread.py", line 52, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/........./venv/lib/python3.9/site-packages/dvc/progress.py", line 133, in wrapped
res = fn(*args, **kwargs)
File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 83, in _stage_file
meta, hash_info = get_file_hash(path_info, fs, name, state=state)
File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 72, in get_file_hash
meta, hash_info = _get_file_hash(path_info, fs, name)
File "/Users/........./venv/lib/python3.9/site-packages/dvc/objects/stage.py", line 57, in _get_file_hash
raise NotImplementedError
The same bug was observed, when using gcsfs.
Debugging it, for _get_file_hash call, the fs_path is a folder, filesystem is GSFileSystem and name is etag. At the same time, info does not have etag for folders, so it lands in NotImplemented part:
def _get_file_hash(fs_path, fs, name):
info = _adapt_info(fs.info(fs_path), fs.scheme)
if name in info:
assert not info[name].endswith(".dir")
hash_value = info[name]
elif hasattr(fs, name):
func = getattr(fs, name)
hash_value = func(fs_path)
elif name == "md5":
hash_value = file_md5(fs_path, fs)
else:
raise NotImplementedError
meta = Meta(size=info["size"])
hash_info = HashInfo(name, hash_value)
return meta, hash_info
What info contains for the folder:
{'bucket': 'our-bucket-here', 'name': 'our-bucket-here/folder/subfolder', 'size': 0, 'storageClass': 'DIRECTORY', 'type': 'directory'}
I strongly suspect, the same kind of behavior is happening with S3 folders, but haven't verified it yet.
Maybe, this information leads you to some ideas?
Update I have found this particular comment: https://github.com/iterative/dvc/issues/5527#issuecomment-788221729
Does this mean, the support for external on bucket objects is dropped permanently?
@VOvchinnikov Support for external outputs in GCS was unfortunately dropped in DVC 2.0. See https://dvc.org/blog/dvc-2-0-release#breaking-changes. We would like to support it again in the future, but it is not being worked on at the moment, and would likely only happen after a refactor of external outputs.
cc @efiop
@pmrowla @efiop Do you want to keep this as p1? Is there a plan to look into it further soon?
I think this can be p2 since the problem mostly occurs in the --external use case. But I think this is still a regression for import-url so it should be looked into some more at some point