dvc
dvc copied to clipboard
repro: Rebuilds same tree unnecessarily
Given a directory tracked with dvc add/dvc import and a dvc.yaml with stages that have that directory as dependeny:
$ cat data.dvc
outs:
- md5: 6f68a8a747e41c152e7cc5fc62437727.dir
size: 2890
nfiles: 1000
path: data
$ cat stages:
foo:
cmd: echo foo
deps:
- data
bar:
cmd: echo bar
deps:
- data
During a dvc repro execution, the same tree for the .dir is being built (_build_tree) multiple times during:
changed_outsfordata.dvcUnless I am missing something, this is the only place where we should really call_build_treeand cache the result.- (for each stage)
changed_deps - (for each stage)
save_depsas part of_run_stage. - (for each stage)
save_depsas part ofsaveI don't really know why we need to callsave_depstwice insidestage.run.
So, in total there are 3 unnecessary (IMO) calls to _build_tree for each stage.
For 100k dummy files, each of these _build_tree calls takes around 10s.
It feels like a significant overhead, especially considering that it grows with the number of files and the number of stages having them as deps.
Don't know if this is something to be addressed in https://github.com/iterative/dvc-data or in DVC as part of pipeline management
Related: https://discord.com/channels/485586884165107732/485596304961962003/1123890104742707231
Do we know if it actually re-hashes each file each time, or it looks like that but it actually iterates over the files in the dir but skips actually re-hashing if nothing changed? I know it's slow either way, but want to identify the true source of the problem. cc @iterative/dvc
@dberenbaum, it does not hash, it goes through the directory and tries to look into the state db if we have the hashes. And it does that for each item, one-by-one which is why it is slow.
@dberenbaum , I also faced the same slowness problem. I'm using git hooks, and it slows down each git commit even if there weren't any changes in the dvc lock files.