dvc icon indicating copy to clipboard operation
dvc copied to clipboard

pull: how to only download data from a specified remote?

Open courentin opened this issue 2 years ago • 5 comments

Bug Report

Description

We have a dvc projects with two remotes (remote_a and remote_b). Most of our stages are parametrized and some outputs contain a remote attribute.

For example:

stages:
  my_stage:
    foreach: ['remote_a', 'remote_b']
    do:
      cmd: echo "my job on ${ key }" > file_${ key }.txt
      outs:
        - file_${ key }.txt:
          remote: ${ key }

We have setup some CI with cml to reproduce stages at each PR. Thus we have two job running, one on remote_a and the other on remote_b. We have this kind of setup because we need to run our machine learning models on 2 different sets of data that need to resides in 2 different aws regions. Thus, the job a should not have access to the remote_b (which is an S3) and the reciprocal is true as well.

However, when running dvc pull --remote_a, it failed with the error Forbidden: An error occurred (403) when calling the HeadObject operation (full logs bellow). Looking at the logs, it seems that dvc pull --remote_a needs read access on remote_b.

Logs of the error
2022-09-14 15:45:05,240 DEBUG: Lockfile 'dvc.lock' needs to be updated.
2022-09-14 15:45:05,321 WARNING: Output 'speech_to_text/models/hparams/dump_transfer.yaml'(stage: 'dump_transfer_yaml') is missing version info. Cache for it will not be collected. Use `dvc repro` to get your pipeline up to date.
2022-09-14 15:45:05,463 DEBUG: Preparing to transfer data from 'dvc-repository-speech-models-eu' to '/github/home/dvc_cache'
2022-09-14 15:45:05,463 DEBUG: Preparing to collect status from '/github/home/dvc_cache'
2022-09-14 15:45:05,463 DEBUG: Collecting status from '/github/home/dvc_cache'
2022-09-14 15:45:05,465 DEBUG: Preparing to collect status from 'dvc-repository-speech-models-eu'
2022-09-14 15:45:05,465 DEBUG: Collecting status from 'dvc-repository-speech-models-eu'
2022-09-14 15:45:05,465 DEBUG: Querying 1 oids via object_exists
2022-09-14 15:45:06,391 ERROR: unexpected error - Forbidden: An error occurred (403) when calling the HeadObject operation: Forbidden
------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 110, in _error_wrapper
    return await func(*args, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/aiobotocore/client.py", line 265, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/commands/data_sync.py", line 31, in run
    stats = self.repo.pull(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/pull.py", line 34, in pull
    processed_files_count = self.fetch(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/fetch.py", line 45, in fetch
    used = self.used_objs(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 430, in used_objs
    for odb, objs in self.index.used_objs(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/index.py", line 240, in used_objs
    for odb, objs in stage.get_used_objs(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/stage/__init__.py", line 695, in get_used_objs
    for odb, objs in out.get_used_objs(*args, **kwargs).items():
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/output.py", line 968, in get_used_objs
    obj = self._collect_used_dir_cache(**kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/output.py", line 908, in _collect_used_dir_cache
    self.get_dir_cache(jobs=jobs, remote=remote)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/output.py", line 890, in get_dir_cache
    self.repo.cloud.pull([obj.hash_info], **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/data_cloud.py", line 136, in pull
    return self.transfer(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/data_cloud.py", line 88, in transfer
    return transfer(src_odb, dest_odb, objs, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/transfer.py", line 158, in transfer
    status = compare_status(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/status.py", line 185, in compare_status
    src_exists, src_missing = status(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/status.py", line 136, in status
    exists = hashes.intersection(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/status.py", line 56, in _indexed_dir_hashes
    dir_exists.update(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_objects/db.py", line 279, in list_oids_exists
    yield from itertools.compress(oids, in_remote)
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 608, in result_iterator
    yield fs.pop().result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 269, in exists
    return self.fs.exists(path)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 111, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 96, in sync
    raise return_result
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 53, in _runner
    result[0] = await coro
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 888, in _exists
    await self._info(path, bucket, key, version_id=version_id)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 1140, in _info
    out = await self._call_s3(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 332, in _call_s3
    return await _error_wrapper(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 137, in _error_wrapper
    raise err
PermissionError: Forbidden
------------------------------------------------------------
2022-09-14 15:45:06,478 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2022-09-14 15:45:06,478 DEBUG: Removing '/__w/speech-models/.5LXz6fyLREr373Vyh2BUtX.tmp'
2022-09-14 15:45:06,479 DEBUG: link type hardlink is not available ([Errno 95] no more link types left to try out)
2022-09-14 15:45:06,479 DEBUG: Removing '/__w/speech-models/.5LXz6fyLREr373Vyh2BUtX.tmp'
2022-09-14 15:45:06,479 DEBUG: Removing '/__w/speech-models/.5LXz6fyLREr373Vyh2BUtX.tmp'
2022-09-14 15:45:06,479 DEBUG: Removing '/github/home/dvc_cache/.7m6JcKcUQKoTh7ZJHogetT.tmp'
2022-09-14 15:45:06,484 DEBUG: Version info for developers:
DVC version: 2.22.0 (pip)
---------------------------------
Platform: Python 3.9.5 on Linux-5.4.0-1083-aws-x86_64-with-glibc2.31
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.7.1, boto3 = 1.21.21)
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3, s3
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-09-14 15:45:06,486 DEBUG: Analytics is enabled.
2022-09-14 15:45:06,527 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpghyf5o18']'
2022-09-14 15:45:06,529 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpghyf5o18']'

The dvc doc seems pretty clear that, only the specified remote will be pulled.

Why do dvc pull --remote remote_a needs access to remote_b though?

Environment information

Output of dvc doctor:

DVC version: 2.22.0 (pip)
---------------------------------
Platform: Python 3.9.5 on Linux-5.4.0-1083-aws-x86_64-with-glibc2.31
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.7.1, boto3 = 1.21.21)
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3, s3
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git

courentin avatar Sep 15 '22 13:09 courentin

mea culpa, the doc explains how the remote flag works and it seems consistent with the behaviour I experienced:

The dvc remote used is determined in order, based on

  • the remote fields in the dvc.yaml or .dvc files.
  • the value passed to the --remote option via CLI.
  • the value of the core.remote config option (see dvc remote default).

However, I'm really wondering how I can download all the data from a specified remote without explicitly listing all the stages/data? (Ideally I'd like not to download everything and only what's required for the repro https://github.com/iterative/dvc/issues/4742).

courentin avatar Sep 16 '22 11:09 courentin

Discussed that first we should document the behavior better in push/pull, but we will also leave this open as a feature request.

dberenbaum avatar Sep 27 '22 12:09 dberenbaum

I took a closer look to document this, and I agree with @courentin that the current behavior is unexpected/unhelpful:

  1. For data that has no remote field, it makes sense to keep the current behavior to push/pull to/from --remote A instead of the default. We could keep a feature request for an additional flag to only push/pull data that matches that specified remote. @courentin Do you need this, or is it okay to include data that has no remote field?
  2. For data that has a specified remote field, I think DVC should skip it on push/pull. It seems surprising and potentially dangerous to need access to remote B even when specifying remote A. With the current behavior, there's no simple workaround to push things when you have access to only one remote. Is there a use case where the current behavior makes more sense?

dberenbaum avatar Sep 27 '22 15:09 dberenbaum

Update on current behavior.

I tested with two local remotes, default and other, and two files, foo and bar, with bar.dvc including remote: other:

$ tree
.
├── bar.dvc
└── foo.dvc

0 directories, 2 files

$ cat .dvc/config
[core]
    remote = default
['remote "default"']
    url = /Users/dave/dvcremote
['remote "other"']
    url = /Users/dave/dvcremote2

$ cat foo.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
  size: 4
  hash: md5
  path: foo

$ cat bar.dvc
outs:
- md5: c157a79031e1c40f85931829bc5fc552
  size: 4
  hash: md5
  path: bar
  remote: other

Here's what dvc pull does with different options (I reset to the state above before each pull).

Simple dvc pull:

$ dvc pull
A       foo
A       bar
2 files added and 2 files fetched

This is what I would expect. It pulls each file from its respective remote.

Next, pulling only from the other remote:

$ dvc pull -r other
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: d3b07384d113edec49eaa6238ad5ff00
A       bar
1 file added and 1 file fetched
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
foo
Is your cache up to date?
<https://error.dvc.org/missing-files>

This makes sense to me also. It pulls only from the other remote. If we want it not to fail, we can include --allow-missing:

$ dvc pull -r other --allow-missing
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: d3b07384d113edec49eaa6238ad5ff00
A       bar
1 file added and 1 file fetched

Finally, we pull only from default:

$ dvc pull -r default
A       bar
A       foo
2 files added and 2 files fetched

This gives us the same behavior as dvc pull without any specified remote. This is the only option that doesn't make sense to me. If I manually specify -r default, I would not expect data to be pulled from other.

dberenbaum avatar Aug 28 '23 12:08 dberenbaum

Thank you for taking a look :)

For data that has no remote field, it makes sense to keep the current behavior to push/pull to/from --remote A instead of the default. We could keep a feature request for an additional flag to only push/pull data that matches that specified remote. @courentin Do you need this, or is it okay to include data that has no remote field?

For my use case, I think it's ok to include data that has no remote field

courentin avatar Oct 10 '23 12:10 courentin

Hello! Thanks a lot for your help responding to this question! I am actually in, I believe, the same exact boat as OP.

I have 2 datasets, which I uploaded up to S3 from local using DVC. On my local, I have a folder with images called datasetA that I uploaded to s3 by doing the dvc add datasetA, dvc push -r remoteA (which is defined in my .dvc config file). I cleared the cache (with a manual file delete), then did the same exact steps to push datasetB to remoteB. In my datasetA.dvc and datasetB.dvc files, I have their remote metadata values set to remoteA and remoteB respectively (the names of the remotes in the config). I did this manually by editing the file.

Next, pulling only from the other remote:

$ dvc pull -r other
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: d3b07384d113edec49eaa6238ad5ff00
A       bar
1 file added and 1 file fetched
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
foo
Is your cache up to date?
<https://error.dvc.org/missing-files>

My goal is to be able to say dvc pull -r remoteA and get DatasetA files only, and vice versa with B. So I cleared my cache (manually again), and did the above commands but they both pulled both remoteA and remoteB. I still have a default remote set to remoteA, but I don't know if that is the issue. I am wondering if there is something I am missing here in how you were able to configure your dvc files to make it work? Thank you so much for everyone's time and help.

(also I wish I was able to supply code but for other reasons I am unable to 😞 , sorry for the inconvenience).

spaghevin avatar Mar 21 '24 21:03 spaghevin