dvc icon indicating copy to clipboard operation
dvc copied to clipboard

fetch: initial fetch of a cloud upload-url takes *forever*

Open d1sounds opened this issue 2 years ago • 7 comments

Bug Report

fetch: initial fetch of a cloud upload-url takes *forever*

Description

I'm importing several large s3 buckets (dvc import-url --version-aware s3://bucket/path) which are about 100k files totalling around 50GB. When I pull a new repo, I expected the initial dvc fetch to be slow, but it's really slow (like 15 hours over gigabit fiber!). Way slower than the initial import-url. I started debugging, and what I found is that almost all of the time is going into the initial remote index building (md5() in dvc_data/index/save.py).

In a nutshell what I found is:

  1. The indexing is linearly fetching the full bucket from s3 to compute the md5, which is way slower than the actual fetch because that's done with some parallelism.
  2. It's unfortunate that the full bucket is being fetched for the md5, thrown away, then refetched for the fetch(). The structure of the code makes sense, but obviously a single fetch would be preferable (the fast one!).
  3. When the remote index is computed (fetch() in dvc_data/index/fetch.py), the index isn't updated (save()) until the entire thing is computed. When thefetch fails or cancels (which easily happens in 15 hours), none of the intermediate progress is saved!
  4. There's also no progress indicated in all this time.

Reproduce

  1. dvc import-url --version-aware s3://bucket/path test
  2. git commit test.dvc
  3. in a new copy of the git repo: git pull
  4. dvc fetch test

Expected

I expect the fetch to take about the same time as the original import-url.

Environment information

I've tried this on 3.19.0 and 3.22.0.

Output of dvc doctor:

DVC version: 3.19.0 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.2.0-32-generic-x86_64-with-glibc2.35
Subprojects:
	dvc_data = 2.16.1
	dvc_objects = 1.0.1
	dvc_render = 0.5.3
	dvc_task = 0.3.0
	scmrepo = 1.3.1
Supports:
	http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2023.9.0, boto3 = 1.28.17)
Config:
	Global: /home/david/.config/dvc
	System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/ac3ad7a1fbc9fd2c2a49af1dfba113c3

You are using dvc version 3.19.0; however, version 3.22.0 is available.
To upgrade, run 'pip install --upgrade dvc'.

Additional Information (if any):

d1sounds avatar Sep 22 '23 16:09 d1sounds

Hey @d1sounds , you pretty much got to the bottom of it. The main difference between dvc import-url and dvc fetch/pull currently could be summarized as import-url actually being 2 commands: download to your workspace and then dvc add that. While dvc fetch/pull doesn't currently have a workspace to download temporarily too, so it has to virtually download files to a content-based storage without knowing their md5s. We've introduced that temporary space for some edge cases already, but not yet for the scenario you've described, so stay tuned. I'll see if I can send something actionable soon or, if not, will later link some issue or something here to track the progress.

efiop avatar Sep 23 '23 23:09 efiop

Would it work to fetch the md5 sums in parallel, rather than linearly?

d1sounds avatar Sep 24 '23 00:09 d1sounds

@d1sounds Sure, but addressing redundant downloads will likely make a bigger difference. Those things are not mutually exclusive though, we don't parallelize hashing locally either (in our plans though).

efiop avatar Sep 24 '23 11:09 efiop

2. It's unfortunate that the full bucket is being fetched

What about this part @efiop?

dberenbaum avatar Sep 25 '23 12:09 dberenbaum

@dberenbaum I assumed @d1sounds just means that we stream all files (not litteraly everything in the bucket, unless bucket only contains this project which is stored in the root). Maybe he @d1sounds could clarify.

efiop avatar Sep 25 '23 18:09 efiop

yes, just the files. but in my case I'm using all the files in the bucket.

d1sounds avatar Sep 26 '23 02:09 d1sounds

Not having a progress indicator for md5 indexing (even with -vv) is really a bad UX and does not depend on the specifics of fetching algo though.

mbergal avatar Nov 06 '23 13:11 mbergal