tensorstore icon indicating copy to clipboard operation
tensorstore copied to clipboard

How to use tensorstore with PyTorch Dataloader

Open GxjGit opened this issue 3 years ago • 4 comments

How would I use tensorstore with PyTorch Dataloader? Any examples?

GxjGit avatar Sep 26 '22 09:09 GxjGit

Can you say a bit more about the input data, and how you want to load it?

I don't have example code, but one example application might be that you have a large 2-d or 3-d image dataset, with a corresponding dataset of the same size with per-pixel/per-voxel labels, and wish to train a convolutional model on randomly selected patches of a fixed size. In this case you might first generate the patch locations, and then use tensorstore to read each individual patch from both the image dataset and the label dataset. Enabling a cache_pool in tensorstore would likely be helpful for this use case.

jbms avatar Sep 26 '22 18:09 jbms

Thanks for your reply. For example, our trainning code is the same as https://github.com/pytorch/examples/blob/main/imagenet/main.py . The dataset is imagenet: https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar . It is a 2-d image dataset with labels. The format of images is jpeg, and images sizes are different. We use PyTorch Dataloader to load the the data.

image

Steps of dataloader are as following:

  1. Read the jpeg file and lable into memory.
  2. Decode jpeg file and get the RGB image.
  3. The RGB image is preprocessed and scaled to a fixed shape tensor (such as 224 x 224 X 3).
  4. The tensor of Scaled images and labels the output of dataloader, the input of model trainning.

My questions is, How to adapt tensorstore in this scene end to end, In detail:

  1. Tensorstore provides reading and writing multiple array formats, like zarr and N5, but my origin dataset is jpeg format, How should I transform my dataset array formats?
  2. In this case, what should be stored in multiple array? The size of images in dataset is different, Do I need to do resize in advance?

In other words, for the scenario where the original dataset is imaget and data is loaded with dataloader, what specific steps are required to use tensorstore.

GxjGit avatar Sep 27 '22 09:09 GxjGit

You should open, decode and resize each image to same shape, e.g. [512, 512, 3], then store all 1.2M images into a [1.2M, 512, 512, 3] size TensorStore. Then accessing that is just a matter of writing the correct DataSet and writing the __getitem__ method to select the images.

Not sure how useful TensorStore is for image data though, since clearly there's no structure in the first (batch) dimension...

Probably something like Webdataset would be more useful for lots of different size images.

harpone avatar Sep 27 '22 13:09 harpone

Thanks @harpone. I have tried Webdataset and it does perform very well. I want to try TensorStore if it's better. Like you said, decoding, data augmentation and other transforms must be done and stored in TensorStor before training, if the data augmentation method or the size of the model changes, it must be reprocessed.

GxjGit avatar Sep 28 '22 01:09 GxjGit