run/repro: add --jobs for transferring
It seems that we can control number of threads when computing hashes through :
dvc config core.checksum_jobs $n_jobs
but it looks not so easy to do so for transferring (see dvc/objects/transfer.transfer()), it just uses the hardcoded value from /opt/conda/lib/python3.7/site-packages/dvc/fs/base.py.
It is important as i found that for some obscur reasons, multithreading dramatically makes hash and transfering slow, but hopefully : forcing the number of threads to 1 fix the issue. Thanks.
Hey @ykacer, the number of threads used when pushing/pulling can be set using the -j/--jobs flag, only when this is not provided it will fallback to the default value for the used filesystem
Transferring also occurs during dvc run and dvc repro and there is no --jobs.
I have the same problem on my setup, DVC is so slow during checksum and transfers that it becomes unusable. In the config file, setting checksum_jobs to 1 did the trick for checksums but there no such option for transfer.
https://github.com/iterative/dvc/blob/220c633497f07c0ad9af0786cb36f738ea18178d/dvc/data/transfer.py#L173
https://github.com/iterative/dvc/blob/dd5d999644dc053625214b828e62a229e3a19be8/dvc/fs/base.py#L53
Transferring also occurs during
dvc runanddvc reproand there is no--jobs.I have the same problem on my setup, DVC is so slow during checksum and transfers that it becomes unusable. In the config file, setting
checksum_jobsto1did the trick for checksums but there no such option for transfer.https://github.com/iterative/dvc/blob/220c633497f07c0ad9af0786cb36f738ea18178d/dvc/data/transfer.py#L173
https://github.com/iterative/dvc/blob/dd5d999644dc053625214b828e62a229e3a19be8/dvc/fs/base.py#L53
For now, we can only manually pull them down with --jobs first.
Closing as stale.
I still have this problem, should I create a new issue?
@fguiotte Ah, sorry, I see that this issue is pretty clear now. Keeping open.
Looks like just need to add jobs flag to those commands and pass it down. Though jobs in repro/run might be taken by people as parallelization of stages instead of parallelzation of transfer. Though for the former we'll need to invent a new name in the future anyway since --jobs already taken.
Thank you for considering this issue :slightly_smiling_face:
Maybe a config option core.transfer_jobs (similar to existing core.checksum_jobs) is easier to add and would do just fine.