push:Google bucket with hyphen in name results in "ERROR: unexpected error - b/my-repository/o"
Bug Report
Description
Followed the instructions to use a GCP remote on this iterative.ai blog post.
When I got to the dvc push step, it failed with ERROR: unexpected error - b/my-repository/o. The last few lines of the traceback:
File "/..masked path.../.venv/lib/python3.10/site-packages/gcsfs/retry.py", line 84, in validate_response
raise FileNotFoundError(path)
FileNotFoundError: b/my-repository/o
See end of this report for the full traceback.
Reproduce
Follow the steps on the GCP remote blog post, naming the GCP bucket with a hyphen, e.g. my-repository.
When running the dvc push part of the guide, it will show the error above.
Expected
Expected the local objects to be uploaded to the GCP bucket.
After I tried with a bucket that doesn't have a hyphen, e.g. myrepository, the dvc push step worked:
> dvc push -v
2022-09-14 16:28:47,718 DEBUG: Preparing to transfer data from '...' to 'myrepository'
...
2022-09-14 16:28:50,281 DEBUG: Querying '775' oids via traverse
...
775 files pushed
Environment information
Output of dvc doctor:
$ dvc doctor
DVC version: 2.24.0 (pip)
---------------------------------
Platform: Python 3.10.2 on macOS-12.4-x86_64-i386-64bit
Subprojects:
dvc_data = 0.4.0
dvc_objects = 0.2.0
dvc_render = 0.0.9
dvc_task = 0.1.2
dvclive = 0.10.0
scmrepo = 0.0.25
Supports:
gs (gcsfs = 2022.8.2),
http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: gs
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Additional Information (if any):
Full error traceback:
> dvc push -v
2022-09-14 16:11:25,599 DEBUG: Preparing to transfer data from '/..masked path.../.dvc/cache' to 'my-repository'
2022-09-14 16:11:25,599 DEBUG: Preparing to collect status from 'my-repository'
2022-09-14 16:11:25,600 DEBUG: Collecting status from 'my-repository'
2022-09-14 16:11:25,600 DEBUG: Querying 1 oids via object_exists
2022-09-14 16:11:28,427 ERROR: unexpected error - b/my-repository/o
------------------------------------------------------------
Traceback (most recent call last):
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc/cli/__init__.py", line 185, in main
ret = cmd.do_run()
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc/cli/command.py", line 22, in do_run
return self.run()
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc/commands/data_sync.py", line 59, in run
processed_files_count = self.repo.push(
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc/repo/push.py", line 68, in push
pushed += self.cloud.push(
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc/data_cloud.py", line 109, in push
return self.transfer(
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc/data_cloud.py", line 88, in transfer
return transfer(src_odb, dest_odb, objs, **kwargs)
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc_data/transfer.py", line 158, in transfer
status = compare_status(
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc_data/status.py", line 179, in compare_status
dest_exists, dest_missing = status(
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc_data/status.py", line 151, in status
odb.oids_exist(hashes, jobs=jobs, progress=pbar.callback)
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc_objects/db.py", line 337, in oids_exist
remote_size, remote_oids = self._estimate_remote_size(
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc_objects/db.py", line 214, in _estimate_remote_size
remote_oids = set(iter_with_pbar(oids))
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc_objects/db.py", line 204, in iter_with_pbar
for oid in oids:
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc_objects/db.py", line 170, in _oids_with_limit
for oid in self._list_oids(prefix):
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc_objects/db.py", line 160, in _list_oids
for path in self._list_paths(prefix):
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc_objects/db.py", line 144, in _list_paths
yield from self.fs.find(self.fs.path.join(*parts), prefix=bool(prefix))
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 535, in find
files = self.fs.find(with_prefix, prefix=self.path.parts(path)[-1])
File "/..masked path.../.venv/lib/python3.10/site-packages/fsspec/asyn.py", line 111, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/..masked path.../.venv/lib/python3.10/site-packages/fsspec/asyn.py", line 96, in sync
raise return_result
File "/..masked path.../.venv/lib/python3.10/site-packages/fsspec/asyn.py", line 53, in _runner
result[0] = await coro
File "/..masked path.../.venv/lib/python3.10/site-packages/dvc_gs/gcsfs.py", line 24, in _find
objects, _ = await self._do_list_objects(
File "/..masked path.../.venv/lib/python3.10/site-packages/gcsfs/core.py", line 521, in _do_list_objects
page = await self._call(
File "/..masked path.../.venv/lib/python3.10/site-packages/gcsfs/core.py", line 392, in _call
status, headers, info, contents = await self._request(
File "/..masked path.../.venv/lib/python3.10/site-packages/decorator.py", line 221, in fun
return await caller(func, *(extras + args), **kw)
File "/..masked path.../.venv/lib/python3.10/site-packages/gcsfs/retry.py", line 115, in retry_request
return await func(*args, **kwargs)
File "/..masked path.../.venv/lib/python3.10/site-packages/gcsfs/core.py", line 384, in _request
validate_response(status, contents, path, args)
File "/..masked path.../.venv/lib/python3.10/site-packages/gcsfs/retry.py", line 84, in validate_response
raise FileNotFoundError(path)
FileNotFoundError: b/my-repository/o
Hi. I think this is related to the fact that the remote is set to the bucket's root. If you're following the guide, could you modify the remote url so that it points to gs://updatedbikedata/cache?
dvc remote modify bikes gs://updatedbikedata/cache
(or just delete/re-add it: dvc remote add -d bikes gs://updatedbikedata/cache)
If you're following the guide, could you modify the remote url so that it points to gs://updatedbikedata/cache?
Thanks for quick reply. Note from the description that:
- I'm following this blog post. In the section to create the remote, it doesn't require the cache. The command there is
$ dvc remote add -d bikes gs://updatedbikedata. - It worked as soon as I changed from
my-repositorytomyrepositoy(removed the hyphen).
To be more precise:
- This fails:
dvc remote add -d my-repository gs://my-repository - This works:
dvc remote add -d myrepository gs://myrepository
looks like a gcsfs limitation, not much we can do on dvc side. Closing for now
Hello @efiop : can you please point to a reference that documents the gcsfs limitation? Asking because I can access buckets with hyphens in other applications.
@cgarbin I don't have anything specific to point to, unfortunately. Just saying that this issue doesn't seem to be in dvc specifically, but rather in an underlying standalone library. One would need to research further (e.g. try gcsfs out), but we don't have the capacity to do it ourselves right now 🙁 If you would be willing to research yourself, we will be happy to help.
@cgarbin does this keep happening with recent dvc versions?
I was able to reproduce back in September, but it seems the issue has been fixed in the meantime.
Tested with
dvc = 2.38.1
dvc_data = 0.28.4
dvc_objects = 0.14.0
gcsfs = 2022.11