dfdx
dfdx copied to clipboard
Datasets/iterator mega issue
dfdx should have a set of easy to use datasets built in (with an optional feature). Things like:
- mnist
- cifar10
- coco (maybe?)
- imagenet/imagenette
- some language datasets
This also encompasses things like DataLoader & dataset iterators. It's still an open question about whether a dataloader is needed, or if that would be encompassed by some BatchIterator<T> class.
Whatever the dataset iterator api is, it should also enable things like torchvision.transforms. E.g. I should be able to do something like:
let transforms = (CenterCrop(10, 10), Normalize(...));
for x, lbl in dataset.iter_batches::<64>(&mut rng).map(transforms) {
...
}
Needs to account for things like the RedCaps dataset which only contains URLs for images, and large datasets which many files and indices
First, just want to say that I'm really excited to see this crate coming along! It's really cool to start to start seeing convolutional networks, in particular, becoming available in Rust.
Second, obviously no pressure to do so, but I wanted to make you aware that that I maintain an existing library for downloading and parsing the CIFAR-10 dataset into Vecndarray
structs (including options for different version for compatibility) that you might find useful. I'm also open adding new features that you need something that's not available! https://github.com/quietlychris/cifar-ten (also on crates.io)
Nice work @quietlychris, looks very useful! Will let you know if there's features that would be useful. Definitely need to think through a trait Dataset
to make writing training loops that are generic over kinds of datasets easier for library devs. Could easily see impl'ing that trait for external datasets like your cifar-ten library
I'll try to work on this; probably will start with an API that's very similar to pytorch
Cool, I think generic associated types will be useful for this:
trait Dataset {
type Item;
type BatchedItem<const BATCH_SIZE: usize>;
}
A sketch of what the frontend UX might look like:
for (img, label) in dataset.iter_shuffled(&dev).batches(10, false) {
// img: Tensor<(Dyn, ...), ...>
// lbl: Tensor<(Dyn, ...), ...>
...
}
for (img, label) in dataset.iter(&dev).const_batches::<10>() {
// img: Tensor<(Const<10>, ...), ...>
// lbl: Tensor<(Const<10>, ...), ...>
...
}
Still a question of how iter/iter_shuffled would work, dataset could just define a len()
and get(usize)
functions like pytorch, but I think they ran into issues with that
One direction that might be interesting is basing this on async iterators / streams...it would handle large datasets as well as the remote case and add minimal overhead for the local case. But I don't think anything else in the lib is async?
You could also base it on something like tower and have behavior like chunking and shuffling and fetching and such be layers one could compose.
async is an interesting idea for sure! But yeah nothing is async explicitly right now. I have noticed that all the cuda APIs are technically async, so maybe async api would fit there.
I'll check out tower!
@coreylowman I built a library for this a while ago: https://github.com/Sidekick-AI/dataflow
It's able to statically define directed acyclic graphs that data flows through, and is lazily computed (batched of course) so it can handle massive datasets. I made it dataset / modality agnostic, so it has no MNIST or imagenet built in. Those would be built as Nodes that can be inserted into the pipeline.
Personally I think the job of a dataloading library is quite different from that of a deep learning library, and there's a pretty clean divide between them, so I think they should be kept seperate. Libs like pytorch try to have a first-party implementation of dataloading but it ends up not being flexible enough so folks end up using third party ones or rolling their own.
Any suggestions or PRs for Dataflow are welcome!
Just pushed an update to Dataflow that implements all functions and closures as Nodes! Meaning you could now build pipelines with pure closures in them:
let pipeline = RandomLoader::new(vec!["file1.txt".to_string(), "file2.txt".to_string()])
.map(|line| format!("Hello {}", line)) // Add hello to each line
.node(tokenizer) // Tokenize the lines
.node(Batch::new(64)); // Create batches of 64
As @coreylowman wrote in that example, a similar looking thing with dataflow would be:
let dataset = Dataloader::new(
RandomLoader::from_dir("imagenet_directory")
.map(convert_to_tensor) // Some function to convert each image to a tensor
.node(Shuffle::default())
.node(ArrayBatch::<BATCH_SIZE>::default()
);
for (image, target) in dataset.iter() {
// Image: Tensor<...>
// Target: Tensor<..>
}
Also, all NLP stuff is moved to a separate crate dataset_nlp
. This way the dataflow crate itself can stay minimal. Currently ~15 deps in total.
So datasets add a significant number of dependencies (mainly for downloading them). There's an trait ExactSizeDataset
in core dfdx and also some helpful dataset iteration traits like batching/collating. Additionally, moving forward application specific training code/models/datasets will be in separate crates (e.g. https://github.com/coreylowman/image-classification repo will contain a bunch of datasets, already includes mnist/cifar for now).
All that said, I'm considering this issue done for now. If in the future there are asks to add datasets into core dfdx, I'm definitely open to it.