dvc
dvc copied to clipboard
pull: how to only download data from a specified remote?
Bug Report
Description
We have a dvc projects with two remotes (remote_a
and remote_b
).
Most of our stages are parametrized and some outputs contain a remote
attribute.
For example:
stages:
my_stage:
foreach: ['remote_a', 'remote_b']
do:
cmd: echo "my job on ${ key }" > file_${ key }.txt
outs:
- file_${ key }.txt:
remote: ${ key }
We have setup some CI with cml to reproduce stages at each PR. Thus we have two job running, one on remote_a
and the other on remote_b
. We have this kind of setup because we need to run our machine learning models on 2 different sets of data that need to resides in 2 different aws regions. Thus, the job a
should not have access to the remote_b
(which is an S3) and the reciprocal is true as well.
However, when running dvc pull --remote_a
, it failed with the error Forbidden: An error occurred (403) when calling the HeadObject operation
(full logs bellow). Looking at the logs, it seems that dvc pull --remote_a
needs read access on remote_b
.
Logs of the error
2022-09-14 15:45:05,240 DEBUG: Lockfile 'dvc.lock' needs to be updated.
2022-09-14 15:45:05,321 WARNING: Output 'speech_to_text/models/hparams/dump_transfer.yaml'(stage: 'dump_transfer_yaml') is missing version info. Cache for it will not be collected. Use `dvc repro` to get your pipeline up to date.
2022-09-14 15:45:05,463 DEBUG: Preparing to transfer data from 'dvc-repository-speech-models-eu' to '/github/home/dvc_cache'
2022-09-14 15:45:05,463 DEBUG: Preparing to collect status from '/github/home/dvc_cache'
2022-09-14 15:45:05,463 DEBUG: Collecting status from '/github/home/dvc_cache'
2022-09-14 15:45:05,465 DEBUG: Preparing to collect status from 'dvc-repository-speech-models-eu'
2022-09-14 15:45:05,465 DEBUG: Collecting status from 'dvc-repository-speech-models-eu'
2022-09-14 15:45:05,465 DEBUG: Querying 1 oids via object_exists
2022-09-14 15:45:06,391 ERROR: unexpected error - Forbidden: An error occurred (403) when calling the HeadObject operation: Forbidden
------------------------------------------------------------
Traceback (most recent call last):
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 110, in _error_wrapper
return await func(*args, **kwargs)
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/aiobotocore/client.py", line 265, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
ret = cmd.do_run()
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
return self.run()
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/commands/data_sync.py", line 31, in run
stats = self.repo.pull(
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/pull.py", line 34, in pull
processed_files_count = self.fetch(
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/fetch.py", line 45, in fetch
used = self.used_objs(
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 430, in used_objs
for odb, objs in self.index.used_objs(
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/index.py", line 240, in used_objs
for odb, objs in stage.get_used_objs(
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/stage/__init__.py", line 695, in get_used_objs
for odb, objs in out.get_used_objs(*args, **kwargs).items():
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/output.py", line 968, in get_used_objs
obj = self._collect_used_dir_cache(**kwargs)
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/output.py", line 908, in _collect_used_dir_cache
self.get_dir_cache(jobs=jobs, remote=remote)
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/output.py", line 890, in get_dir_cache
self.repo.cloud.pull([obj.hash_info], **kwargs)
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/data_cloud.py", line 136, in pull
return self.transfer(
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/data_cloud.py", line 88, in transfer
return transfer(src_odb, dest_odb, objs, **kwargs)
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/transfer.py", line 158, in transfer
status = compare_status(
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/status.py", line 185, in compare_status
src_exists, src_missing = status(
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/status.py", line 136, in status
exists = hashes.intersection(
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/status.py", line 56, in _indexed_dir_hashes
dir_exists.update(
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_objects/db.py", line 279, in list_oids_exists
yield from itertools.compress(oids, in_remote)
File "/usr/lib/python3.9/concurrent/futures/_base.py", line 608, in result_iterator
yield fs.pop().result()
File "/usr/lib/python3.9/concurrent/futures/_base.py", line 445, in result
return self.__get_result()
File "/usr/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
raise self._exception
File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
result = self.fn(*self.args, **self.kwargs)
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 269, in exists
return self.fs.exists(path)
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 111, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 96, in sync
raise return_result
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 53, in _runner
result[0] = await coro
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 888, in _exists
await self._info(path, bucket, key, version_id=version_id)
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 1140, in _info
out = await self._call_s3(
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 332, in _call_s3
return await _error_wrapper(
File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 137, in _error_wrapper
raise err
PermissionError: Forbidden
------------------------------------------------------------
2022-09-14 15:45:06,478 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2022-09-14 15:45:06,478 DEBUG: Removing '/__w/speech-models/.5LXz6fyLREr373Vyh2BUtX.tmp'
2022-09-14 15:45:06,479 DEBUG: link type hardlink is not available ([Errno 95] no more link types left to try out)
2022-09-14 15:45:06,479 DEBUG: Removing '/__w/speech-models/.5LXz6fyLREr373Vyh2BUtX.tmp'
2022-09-14 15:45:06,479 DEBUG: Removing '/__w/speech-models/.5LXz6fyLREr373Vyh2BUtX.tmp'
2022-09-14 15:45:06,479 DEBUG: Removing '/github/home/dvc_cache/.7m6JcKcUQKoTh7ZJHogetT.tmp'
2022-09-14 15:45:06,484 DEBUG: Version info for developers:
DVC version: 2.22.0 (pip)
---------------------------------
Platform: Python 3.9.5 on Linux-5.4.0-1083-aws-x86_64-with-glibc2.31
Supports:
http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
s3 (s3fs = 2022.7.1, boto3 = 1.21.21)
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3, s3
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-09-14 15:45:06,486 DEBUG: Analytics is enabled.
2022-09-14 15:45:06,527 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpghyf5o18']'
2022-09-14 15:45:06,529 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpghyf5o18']'
The dvc doc seems pretty clear that, only the specified remote will be pulled.
Why do dvc pull --remote remote_a
needs access to remote_b
though?
Environment information
Output of dvc doctor
:
DVC version: 2.22.0 (pip)
---------------------------------
Platform: Python 3.9.5 on Linux-5.4.0-1083-aws-x86_64-with-glibc2.31
Supports:
http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
s3 (s3fs = 2022.7.1, boto3 = 1.21.21)
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3, s3
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git
mea culpa, the doc explains how the remote flag works and it seems consistent with the behaviour I experienced:
The dvc remote used is determined in order, based on
- the remote fields in the dvc.yaml or .dvc files.
- the value passed to the --remote option via CLI.
- the value of the core.remote config option (see dvc remote default).
However, I'm really wondering how I can download all the data from a specified remote without explicitly listing all the stages/data? (Ideally I'd like not to download everything and only what's required for the repro https://github.com/iterative/dvc/issues/4742).
Discussed that first we should document the behavior better in push/pull, but we will also leave this open as a feature request.
I took a closer look to document this, and I agree with @courentin that the current behavior is unexpected/unhelpful:
- For data that has no
remote
field, it makes sense to keep the current behavior to push/pull to/from--remote A
instead of the default. We could keep a feature request for an additional flag to only push/pull data that matches that specified remote. @courentin Do you need this, or is it okay to include data that has noremote
field? - For data that has a specified
remote
field, I think DVC should skip it on push/pull. It seems surprising and potentially dangerous to need access to remote B even when specifying remote A. With the current behavior, there's no simple workaround to push things when you have access to only one remote. Is there a use case where the current behavior makes more sense?
Update on current behavior.
I tested with two local remotes, default
and other
, and two files, foo
and bar
, with bar.dvc
including remote: other
:
$ tree
.
├── bar.dvc
└── foo.dvc
0 directories, 2 files
$ cat .dvc/config
[core]
remote = default
['remote "default"']
url = /Users/dave/dvcremote
['remote "other"']
url = /Users/dave/dvcremote2
$ cat foo.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
size: 4
hash: md5
path: foo
$ cat bar.dvc
outs:
- md5: c157a79031e1c40f85931829bc5fc552
size: 4
hash: md5
path: bar
remote: other
Here's what dvc pull
does with different options (I reset to the state above before each pull).
Simple dvc pull
:
$ dvc pull
A foo
A bar
2 files added and 2 files fetched
This is what I would expect. It pulls each file from its respective remote.
Next, pulling only from the other
remote:
$ dvc pull -r other
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: d3b07384d113edec49eaa6238ad5ff00
A bar
1 file added and 1 file fetched
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
foo
Is your cache up to date?
<https://error.dvc.org/missing-files>
This makes sense to me also. It pulls only from the other
remote. If we want it not to fail, we can include --allow-missing
:
$ dvc pull -r other --allow-missing
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: d3b07384d113edec49eaa6238ad5ff00
A bar
1 file added and 1 file fetched
Finally, we pull only from default
:
$ dvc pull -r default
A bar
A foo
2 files added and 2 files fetched
This gives us the same behavior as dvc pull
without any specified remote. This is the only option that doesn't make sense to me. If I manually specify -r default
, I would not expect data to be pulled from other
.
Thank you for taking a look :)
For data that has no remote field, it makes sense to keep the current behavior to push/pull to/from --remote A instead of the default. We could keep a feature request for an additional flag to only push/pull data that matches that specified remote. @courentin Do you need this, or is it okay to include data that has no remote field?
For my use case, I think it's ok to include data that has no remote field
Hello! Thanks a lot for your help responding to this question! I am actually in, I believe, the same exact boat as OP.
I have 2 datasets, which I uploaded up to S3 from local using DVC. On my local, I have a folder with images called datasetA that I uploaded to s3 by doing the dvc add datasetA, dvc push -r remoteA (which is defined in my .dvc config file). I cleared the cache (with a manual file delete), then did the same exact steps to push datasetB to remoteB. In my datasetA.dvc and datasetB.dvc files, I have their remote metadata values set to remoteA and remoteB respectively (the names of the remotes in the config). I did this manually by editing the file.
Next, pulling only from the
other
remote:$ dvc pull -r other WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files: md5: d3b07384d113edec49eaa6238ad5ff00 A bar 1 file added and 1 file fetched ERROR: failed to pull data from the cloud - Checkout failed for following targets: foo Is your cache up to date? <https://error.dvc.org/missing-files>
My goal is to be able to say dvc pull -r remoteA and get DatasetA files only, and vice versa with B. So I cleared my cache (manually again), and did the above commands but they both pulled both remoteA and remoteB. I still have a default remote set to remoteA, but I don't know if that is the issue. I am wondering if there is something I am missing here in how you were able to configure your dvc files to make it work? Thank you so much for everyone's time and help.
(also I wish I was able to supply code but for other reasons I am unable to 😞 , sorry for the inconvenience).