dfdx icon indicating copy to clipboard operation
dfdx copied to clipboard

Datasets/iterator mega issue

Open coreylowman opened this issue 1 year ago • 4 comments

dfdx should have a set of easy to use datasets built in (with an optional feature). Things like:

  • mnist
  • cifar10
  • coco (maybe?)
  • imagenet/imagenette
  • some language datasets

This also encompasses things like DataLoader & dataset iterators. It's still an open question about whether a dataloader is needed, or if that would be encompassed by some BatchIterator<T> class.

coreylowman avatar Aug 21 '22 14:08 coreylowman

Whatever the dataset iterator api is, it should also enable things like torchvision.transforms. E.g. I should be able to do something like:

let transforms = (CenterCrop(10, 10), Normalize(...));
for x, lbl in dataset.iter_batches::<64>(&mut rng).map(transforms) {
    ...
}

coreylowman avatar Aug 21 '22 15:08 coreylowman

Needs to account for things like the RedCaps dataset which only contains URLs for images, and large datasets which many files and indices

coreylowman avatar Sep 01 '22 20:09 coreylowman

First, just want to say that I'm really excited to see this crate coming along! It's really cool to start to start seeing convolutional networks, in particular, becoming available in Rust.

Second, obviously no pressure to do so, but I wanted to make you aware that that I maintain an existing library for downloading and parsing the CIFAR-10 dataset into Vec or ndarray structs (including options for different version for compatibility) that you might find useful. I'm also open adding new features that you need something that's not available! https://github.com/quietlychris/cifar-ten (also on crates.io)

quietlychris avatar Sep 03 '22 13:09 quietlychris

Nice work @quietlychris, looks very useful! Will let you know if there's features that would be useful. Definitely need to think through a trait Dataset to make writing training loops that are generic over kinds of datasets easier for library devs. Could easily see impl'ing that trait for external datasets like your cifar-ten library

coreylowman avatar Sep 05 '22 16:09 coreylowman

I'll try to work on this; probably will start with an API that's very similar to pytorch

cBournhonesque avatar Nov 01 '22 22:11 cBournhonesque

Cool, I think generic associated types will be useful for this:

trait Dataset {
    type Item;
    type BatchedItem<const BATCH_SIZE: usize>;
}

coreylowman avatar Nov 02 '22 12:11 coreylowman