streaming
streaming copied to clipboard
Support for lists for supported encodings
🚀 Feature Request
The current supported encodings are listed here: https://github.com/mosaicml/streaming/blob/59f6ec5f8f97cc5f9a75954fef4bef3221460ff8/streaming/base/format/mds/encodings.py#L270
I would like to have support for lists, i.e. columns that are lists of integers, jpegs, etc. Theses list can be of arbitrary lengths.
Motivation
This is particularly relevant for interleaved datasets (think documents that are a sequence of either text or images).
HI @VictorSanh, thanks for filing a feature request.
I would like to understand your request a bit more in detail. Are you trying to interleave/merge different sub-dataset (which contains multiple samples) based on certain condition (ratio, probabilities, etc) to get one unified dataset? For example
ds1 = Dataset.from_dict({"data": [1, 2, 3]}) # here 1, 2, and 3 represent an individual sample
ds2 = Dataset.from_dict({"data": [6, 7, 8, 9, 10]}) # here 6, 7, 8, 9, and 10 represent an individual sample
dataset = interleave_datasets([ds1, ds2], probabilities=[0.3, 0.7])
or does your single sample in a dataset contains list of data such as [text, text, text, ....]
or [images, images, image, ....]
or [int, int, int, ...]
?
That's a good question, I understand now why there is confusion 😅
It's the second!
The sample would be something like:
sample = {
"images": [None, image1, None, None, image2, None],
"texts": [text1, None, text2, text3, None, text4]
}
So in the end, the real document would be:
[text1, image1, text2, text3, image2, text4]
Let me know if it makes sense.
@VictorSanh Yes, this is possible to do it, just that we haven’t prioritized this feature yet. Are you blocked by this feature or do you have a workaround solution ? Please feel free to also open a PR with this feature if you have an implementation in mind. Thank You!
Thanks for the answer, i'll see what i can do!
@VictorSanh Are you blocked by this feature or how critical is this feature for you ? and do you have any temporary workaround solution in mind for now ?
i haven't got time to get to it, but was hoping to be able to hack something this week! (and yes, i kind of need it to make make training work :) )
Chiming in here to +1 the feature request (hey @VictorSanh)! Having support for lists/timeseries is something that would be amazing for some ongoing projects we have at Stanford around imitation learning for robotics.
Similar to Victor's setup, we'd want to train on multiple demonstrations (samples) of the following form:
demonstration = {
"states": [np.ndarray, np.ndarray, np.ndarray, ...], # Can be variable length across samples
"camera_1": [img1, img2, img3, ...],
"camera_2": ...,
"actions": [np.ndarray, ...],
}
Would love to know if there's a hack/workaround in the meantime! Thanks!
You could pickle them, but as I understand, pickle will encode the images as their CHW byte arrays, which would be rather wasteful for larger images.
Let me know if this is still unresolved and I will be happy to write a custom encode/decode for you.
@knighton +1 for the feature request! In my case my dataset is also a interleaved text and image one, so in one sample we may have multiple images, like [img1, img2, ...]
.
Since streaming does not support lists, I have to store the image list in bytes/pickle, which is much larger than jpeg. Actually, I found it is almost 20x larger!
The thing is that my dataset is quite big, thus the space is not enough to save images in bytes. Therefore, I would appreciate it if streaming can support lists or there is a demo code for how to handle this!
@knighton @karan6181 Any updates on this?
I update my solution here for anyone that needs help.
In streaming, each jpeg is saved as bytes, which can be seen from here:
https://github.com/mosaicml/streaming/blob/59f6ec5f8f97cc5f9a75954fef4bef3221460ff8/streaming/base/format/mds/encodings.py#L207-L223
Therefore, if we want to save a list of jpeg images, we need to concatenate all the bytes. To recover the list later, we also need to save the length of each bytes and prepend it to the beginning of each bytes. So the code is as follows:
from streaming.base.format.mds.encodings import Encoding, _encodings
from typing import List
from io import BytesIO
from PIL import Image
# Streaming data type for encoding and decoding list of JPEG images
class JPEGList(Encoding):
def encode(self, jpeg_list: List):
output = b''
for jpeg in jpeg_list:
o = BytesIO()
jpeg.save(o, format='JPEG')
byte = o.getvalue()
# We need to append the length of the image to the front of the image
leng = len(byte)
leng = leng.to_bytes(2, byteorder='big')
output += leng + byte
return output
def decode(self, data: bytes):
output = []
while len(data) > 0:
leng = int.from_bytes(data[:2], byteorder='big')
data = data[2:]
byte = data[:leng]
data = data[leng:]
output.append(Image.open(BytesIO(byte)))
return output
# Register the encoding
_encodings["jpeg_list"] = JPEGList
Note that the current code can only save bytes that is shorter than 65536, since I use leng = leng.to_bytes(length=2, byteorder='big')
. If you want to save larger images, you may want to increase the length
.