dfdx Datasets/iterator mega issue

dfdx should have a set of easy to use datasets built in (with an optional feature). Things like:

mnist
cifar10
coco (maybe?)
imagenet/imagenette
some language datasets

This also encompasses things like DataLoader & dataset iterators. It's still an open question about whether a dataloader is needed, or if that would be encompassed by some BatchIterator<T> class.

Aug 21 '22 14:08 coreylowman

Whatever the dataset iterator api is, it should also enable things like torchvision.transforms. E.g. I should be able to do something like:

let transforms = (CenterCrop(10, 10), Normalize(...));
for x, lbl in dataset.iter_batches::<64>(&mut rng).map(transforms) {
    ...
}

Aug 21 '22 15:08 coreylowman

Needs to account for things like the RedCaps dataset which only contains URLs for images, and large datasets which many files and indices

Sep 01 '22 20:09 coreylowman

First, just want to say that I'm really excited to see this crate coming along! It's really cool to start to start seeing convolutional networks, in particular, becoming available in Rust.

Second, obviously no pressure to do so, but I wanted to make you aware that that I maintain an existing library for downloading and parsing the CIFAR-10 dataset into Vec or ndarray structs (including options for different version for compatibility) that you might find useful. I'm also open adding new features that you need something that's not available! https://github.com/quietlychris/cifar-ten (also on crates.io)

Sep 03 '22 13:09 quietlychris

Nice work @quietlychris, looks very useful! Will let you know if there's features that would be useful. Definitely need to think through a trait Dataset to make writing training loops that are generic over kinds of datasets easier for library devs. Could easily see impl'ing that trait for external datasets like your cifar-ten library

Sep 05 '22 16:09 coreylowman

I'll try to work on this; probably will start with an API that's very similar to pytorch

Nov 01 '22 22:11 cBournhonesque

Cool, I think generic associated types will be useful for this:

trait Dataset {
    type Item;
    type BatchedItem<const BATCH_SIZE: usize>;
}

Nov 02 '22 12:11 coreylowman

A sketch of what the frontend UX might look like:

for (img, label) in dataset.iter_shuffled(&dev).batches(10, false) {
    // img: Tensor<(Dyn, ...), ...>
    // lbl: Tensor<(Dyn, ...), ...>
    ...
}

for (img, label) in dataset.iter(&dev).const_batches::<10>() {
    // img: Tensor<(Const<10>, ...), ...>
    // lbl: Tensor<(Const<10>, ...), ...>
    ...
}

Still a question of how iter/iter_shuffled would work, dataset could just define a len() and get(usize) functions like pytorch, but I think they ran into issues with that

Dec 11 '22 20:12 coreylowman

One direction that might be interesting is basing this on async iterators / streams...it would handle large datasets as well as the remote case and add minimal overhead for the local case. But I don't think anything else in the lib is async?

You could also base it on something like tower and have behavior like chunking and shuffling and fetching and such be layers one could compose.

Dec 12 '22 02:12 LegNeato

async is an interesting idea for sure! But yeah nothing is async explicitly right now. I have noticed that all the cuda APIs are technically async, so maybe async api would fit there.

I'll check out tower!

Dec 12 '22 14:12 coreylowman

@coreylowman I built a library for this a while ago: https://github.com/Sidekick-AI/dataflow

It's able to statically define directed acyclic graphs that data flows through, and is lazily computed (batched of course) so it can handle massive datasets. I made it dataset / modality agnostic, so it has no MNIST or imagenet built in. Those would be built as Nodes that can be inserted into the pipeline.

Personally I think the job of a dataloading library is quite different from that of a deep learning library, and there's a pretty clean divide between them, so I think they should be kept seperate. Libs like pytorch try to have a first-party implementation of dataloading but it ends up not being flexible enough so folks end up using third party ones or rolling their own.

Any suggestions or PRs for Dataflow are welcome!

Dec 15 '22 22:12 jafioti

Just pushed an update to Dataflow that implements all functions and closures as Nodes! Meaning you could now build pipelines with pure closures in them:

let pipeline = RandomLoader::new(vec!["file1.txt".to_string(), "file2.txt".to_string()])
      .map(|line| format!("Hello {}", line)) // Add hello to each line
      .node(tokenizer) // Tokenize the lines
      .node(Batch::new(64)); // Create batches of 64

As @coreylowman wrote in that example, a similar looking thing with dataflow would be:

let dataset = Dataloader::new(
     RandomLoader::from_dir("imagenet_directory")
          .map(convert_to_tensor) // Some function to convert each image to a tensor
          .node(Shuffle::default())
          .node(ArrayBatch::<BATCH_SIZE>::default()
);

for (image, target) in dataset.iter() {
     // Image: Tensor<...>
     // Target: Tensor<..>
}

Also, all NLP stuff is moved to a separate crate dataset_nlp. This way the dataflow crate itself can stay minimal. Currently ~15 deps in total.

Dec 20 '22 23:12 jafioti

So datasets add a significant number of dependencies (mainly for downloading them). There's an trait ExactSizeDataset in core dfdx and also some helpful dataset iteration traits like batching/collating. Additionally, moving forward application specific training code/models/datasets will be in separate crates (e.g. https://github.com/coreylowman/image-classification repo will contain a bunch of datasets, already includes mnist/cifar for now).

All that said, I'm considering this issue done for now. If in the future there are asks to add datasets into core dfdx, I'm definitely open to it.

Mar 04 '23 19:03 coreylowman

dfdx dfdx copied to clipboard

Datasets/iterator mega issue

dfdx
dfdx copied to clipboard