dvc repro: Rebuilds same tree unnecessarily

Given a directory tracked with dvc add/dvc import and a dvc.yaml with stages that have that directory as dependeny:

$ cat data.dvc
outs:
- md5: 6f68a8a747e41c152e7cc5fc62437727.dir
  size: 2890
  nfiles: 1000
  path: data
$ cat stages:
  foo:
    cmd: echo foo
    deps:
    - data
  bar:
    cmd: echo bar
    deps:
    - data

During a dvc repro execution, the same tree for the .dir is being built (_build_tree) multiple times during:

changed_outs for data.dvc Unless I am missing something, this is the only place where we should really call _build_tree and cache the result.
(for each stage) changed_deps
(for each stage) save_deps as part of _run_stage.
(for each stage) save_deps as part of save I don't really know why we need to call save_deps twice inside stage.run.

So, in total there are 3 unnecessary (IMO) calls to _build_tree for each stage.

For 100k dummy files, each of these _build_tree calls takes around 10s.

It feels like a significant overhead, especially considering that it grows with the number of files and the number of stages having them as deps.

Don't know if this is something to be addressed in https://github.com/iterative/dvc-data or in DVC as part of pipeline management

Feb 27 '23 15:02 daavoo

Related: https://discord.com/channels/485586884165107732/485596304961962003/1123890104742707231

Jun 29 '23 10:06 daavoo

Do we know if it actually re-hashes each file each time, or it looks like that but it actually iterates over the files in the dir but skips actually re-hashing if nothing changed? I know it's slow either way, but want to identify the true source of the problem. cc @iterative/dvc

Aug 23 '23 15:08 dberenbaum

@dberenbaum, it does not hash, it goes through the directory and tries to look into the state db if we have the hashes. And it does that for each item, one-by-one which is why it is slow.

Aug 23 '23 15:08 skshetry

@dberenbaum , I also faced the same slowness problem. I'm using git hooks, and it slows down each git commit even if there weren't any changes in the dvc lock files.

Dec 12 '23 17:12 dbalabka

dvc dvc copied to clipboard

repro: Rebuilds same tree unnecessarily

dvc
dvc copied to clipboard