petastorm
petastorm copied to clipboard
Access a specific row in the dataframe
Hi guys
I was googling for a while but couldn't really find an answer to my question so I will post it here.
I am looking for a database format for a DL task with lots of videos. So petastorm looks interesting allowing to encode all the frames directly as jpeg images. However, I am unsure how to use it with a custom DataLoader in PyTorch. Instead of using a predefined ordering or iterating over all frames, I would rather like to query a sparse amount of frames for a randomly selected video for each training sample in a batch. Similar to a dictionary lookup.
All I could find is the python API solution with the Reader object that doesn't seem to support what I am looking for. So my question is, is there an efficient way to query and decode rows of a data frame in petastorm to be used in DL with pytorch?
Thanks in advance.
How long are your video clips? Do you want to read video segments from a video clip for your training? How many frames long are these? How many clips do you have in total? What is the typical resolution of your video?
I think there are many different ways to approach your problem. Not sure if parquet is the best approach in your case, but the devil is in the details.
So each video clip contains a few thousand frames typically. And I have around one to two thousand clips (for each dataset). I want to be able to read single frames from video clips during training specified by their frame number. Depending on the model I might need consecutive frames or frames that are several frames to form a training episode. They are then stacked to a batch. Resolution varies but can be up to 720x1080p.
I am using hdf5 at the moment. It is working, but I was hoping to find something faster...
In my opinion parquet (hence petastorm) might work, but you must be aware of the following challenges that you would have to solve:
- Random frame access + parquet does not work well since you'll need to load entire row-groups (as row-group is an atomic reading unit). You are likely to end with having hundreds of images in one row-group (it's tunable).
- You would either need to through away the frames that you were not interested in, or have a large in-memory shuffling buffer (perhaps 10s of GBs for a good decorrelation of samples?).
- You could use ngrams to get sequences of frames, however, getting several frames from the same clip - not sure how would you do it with petastorm with a reasonable RAM footprint. Need to think about this more.
Parquet is not really built with images as datatypes in mind. This fact may induce friction in your case that you won't be happy with.
I haven't work much with hdf5 myself. On the paper, it looks like a good fit for your scenario. What are the challenges with performance you've encountered?
Thanks for the detailed comments. So it really seems that parquet is not suitable for me then. Hm.
The challenges I faced with HDF5 are big fluctuations in image access time. Sometimes it is 10x bigger. Not sure what is causing that. I stored the images in binary format to keep the overall file size small. I guess this makes the indexing slower... I noticed similar issues for LMDB. So thought maybe parquet could help but doesn't seem to be a better candidate. Thanks anyway!