dvc
dvc copied to clipboard
`dvc queue start` checkout more files than required
Bug Report / Feature Request
I have a project with a data directory (150 GB) containing 11 files. I have added the entire directory using dvc add data.
In my workflow each experiment I want to conduct only depends on a single file in the data directory.
Running an experiment using the dvc queue will dvc checkout the entire data directory.
It would be much faster if the command only dvc checkout data files, which are actually required by the workflow, as defined in the dvc.yaml .
Expected
The data directory in .dvc/tmp/exp/... would only contain files, specified as explicit dependencies in the dvc.yaml workflow file.
Environment information
Output of dvc doctor:
DVC version: 3.43.1 (pip)
-------------------------
Platform: Python 3.11.7 on Linux-6.5.0-15-generic-x86_64-with-glibc2.35
Subprojects:
dvc_data = 3.9.0
dvc_objects = 3.0.6
dvc_render = 1.0.1
dvc_task = 0.3.0
scmrepo = 2.0.2
Supports:
http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
s3 (s3fs = 2024.2.0, boto3 = 1.34.34)
Config:
Global: /tikhome/fzills/.config/dvc
System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: nfs on 129.69.120.13:/share/work_icp/fzills
Caches: local
Remotes: None
Workspace directory: nfs on 129.69.120.13:/share/work_icp/fzills
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/240bb452ebd33bc5c31f30d78040c7d2