dvc-bench icon indicating copy to clipboard operation
dvc-bench copied to clipboard

push/import: reconsider benchmarks using real data

Open pared opened this issue 4 years ago • 1 comments

Currently in cases where we want to use "real" data we use cats_dogs dataset which is 25k images, altogether ~800mb. Benchmarks using them, tend to take a lot of time. I think its worth considering providing few smaller real use case datasets and run the benchmarks few times, instead of running single enormous benchmark once.

Also, that way we could test some edge cases. For example, MNIST can be downloaded as 4 files (the original format from its website), but we can convert it into images folder. ( In case of image folder its probably best to use only validation dataset (10k files), as altogether there is 60k files, which will result in prolonged benchmarks, even though there is not too much data byte-wise).

Why? If we get the regression it takes some time to verify it on personal computer. Smaller benchmarks and their regressions could be inspected faster.

If we decide to go this way, we should probably consider refactoring push/import benchmarks to run locally, to avoid bandwidth variability influencing our benchmarks.

pared avatar Jan 27 '21 12:01 pared

Idea: maybe we should implement local s3 storage using MinIO, to cut the time of data transfer.

pared avatar Mar 25 '21 11:03 pared