dvc
dvc copied to clipboard
UI progress bar for computing hashes
Hi,
I regularly add large directories with many many files to dvc storages. All works fine, but I just don't know how long it takes. Most of the steps are well shown when it comes to progress, but for computing the file/dir hashes, there is virtually no way to know how long it takes.
My Proposal: After initially scanning for files, it should be known, for how many files we have to compute hashes for. Just make the line from
Computing file/dir hashes (only done once) | 2.33M [42:05, 1.33kmd5/s]
a real progressbar. It does not even have to show an eta (although that would be nice), but knowing how many files there are in total would already help a lot in order to estimate the overall progress :)
Thanks a lot!
Note that DVC currently does not scan for files in advance, it builds a tree as it walks through them, so we don't know how many objects are there before that building process.
Which happens here: https://github.com/iterative/dvc-data/blob/53473882c36bbba821931da0ca8bed8af1a51322/src/dvc_data/stage.py#L80-L105
@skshetry I see, that's a good point, thanks. Would scanning for files in advance slow things down? I guess it wouldn't (theoretically when the files are known, the hashing could also be parallelized I guess).
@skshetry I see, that's a good point, thanks. Would scanning for files in advance slow things down? I guess it wouldn't (theoretically when the files are known, the hashing could also be parallelized I guess).
It might be slower for a very large directory. It will spend that time doing nothing, whereas we can parallelize those as you said and walk at the same time.
As a CLI application, we should have as less runtime as possible, sitting idle is not ideal. We also need to have a good CLI experience for sure, it's a tradeoff.
Note that in the current released version, we do parallelize them, but we are refactoring that part and have removed it currently as we rethink our approach to this.
https://github.com/iterative/dvc-data/blob/0.0.6/src/dvc_data/stage.py#L85-L114
Unlike git, we mix both index building and object building together right now, but if (or more like when) we will start building an index first, we will be able to provide the total during object building.