dvc-bench icon indicating copy to clipboard operation
dvc-bench copied to clipboard

add bigger data sizes

Open efiop opened this issue 3 years ago • 6 comments

E.g. 1M files dataset, and 10M (maybe more as well?) dataset would be great to have.

efiop avatar Nov 29 '21 07:11 efiop

It seems to me that we could try obtaining ImageNet for this use case. Its de-facto a standard dataset and can actually be used to fulfill both needs. The whole dataset contains around 14M images, and the most used subset is around 1.3 M samples. The license can be found here: https://image-net.org/download.php Seems to me that benchmarking would fall into the research category. I haven't yet requested access due to the 6th point of the license.

pared avatar Dec 09 '21 10:12 pared

Yeah, a bit hesitant to use a third party dataset like that. We could generate it ourselves, I suppose. Ideally with something that would make verifying integrity easy (this is not necessarily useful for benchmarks, but in other tests).

efiop avatar Dec 09 '21 12:12 efiop

It is probably better to just use https://pypi.org/project/Faker/ to generate the biggest dataset and then have small/tiny/etc options based on it, as we do now.

EDIT: on closer inspection, it requires us to set certain parameters, which need us to know what we are doing 😄 So maybe real one is more reasonable, if we can settle the license stuff. At least using MNIST actually tells something to our users, as they've probably used it at some point so they have a pretty good understanding of how long it usually takes to do stuff with it.

efiop avatar Dec 12 '21 16:12 efiop

https://storage.googleapis.com/openimages/web/download.html ?

I'm missing some context but, why not just generate X images with random pixels. Like:

import numpy
from PIL import Image

NUM_IMAGES = 1e6
for i in range(NUM_IMAGES):
    array = numpy.random.rand(100, 100, 3) * 255
    img = Image.fromarray(array.astype(np.uint8))
    img.save(f"dataset/{i}.jpg"))

daavoo avatar Dec 13 '21 14:12 daavoo

For the record: added mnist (70K dataset), but some other bigger buzzwordy dataset would be nice in the future.

efiop avatar May 18 '22 13:05 efiop

It would be also nice to have a dataset with big individual fiels

daavoo avatar Jul 07 '22 20:07 daavoo