streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Support for lists for supported encodings

Open VictorSanh opened this issue 1 year ago • 11 comments

🚀 Feature Request

The current supported encodings are listed here: https://github.com/mosaicml/streaming/blob/59f6ec5f8f97cc5f9a75954fef4bef3221460ff8/streaming/base/format/mds/encodings.py#L270

I would like to have support for lists, i.e. columns that are lists of integers, jpegs, etc. Theses list can be of arbitrary lengths.

Motivation

This is particularly relevant for interleaved datasets (think documents that are a sequence of either text or images).

VictorSanh avatar Mar 24 '23 21:03 VictorSanh

HI @VictorSanh, thanks for filing a feature request.

I would like to understand your request a bit more in detail. Are you trying to interleave/merge different sub-dataset (which contains multiple samples) based on certain condition (ratio, probabilities, etc) to get one unified dataset? For example

ds1 = Dataset.from_dict({"data": [1, 2, 3]}) # here 1, 2, and 3 represent an individual sample
ds2 = Dataset.from_dict({"data": [6, 7, 8, 9, 10]}) # here 6, 7, 8, 9, and 10 represent an individual sample
dataset = interleave_datasets([ds1, ds2], probabilities=[0.3, 0.7])

or does your single sample in a dataset contains list of data such as [text, text, text, ....] or [images, images, image, ....] or [int, int, int, ...] ?

karan6181 avatar Mar 24 '23 22:03 karan6181

That's a good question, I understand now why there is confusion 😅

It's the second!

The sample would be something like:

sample = {
    "images": [None, image1, None, None, image2, None],
    "texts": [text1, None, text2, text3, None, text4]
}

So in the end, the real document would be:

[text1, image1, text2, text3, image2, text4]

Let me know if it makes sense.

VictorSanh avatar Mar 24 '23 22:03 VictorSanh

@VictorSanh Yes, this is possible to do it, just that we haven’t prioritized this feature yet. Are you blocked by this feature or do you have a workaround solution ? Please feel free to also open a PR with this feature if you have an implementation in mind. Thank You!

karan6181 avatar Mar 28 '23 17:03 karan6181

Thanks for the answer, i'll see what i can do!

VictorSanh avatar Mar 30 '23 00:03 VictorSanh

@VictorSanh Are you blocked by this feature or how critical is this feature for you ? and do you have any temporary workaround solution in mind for now ?

karan6181 avatar Mar 31 '23 15:03 karan6181

i haven't got time to get to it, but was hoping to be able to hack something this week! (and yes, i kind of need it to make make training work :) )

VictorSanh avatar Apr 11 '23 21:04 VictorSanh

Chiming in here to +1 the feature request (hey @VictorSanh)! Having support for lists/timeseries is something that would be amazing for some ongoing projects we have at Stanford around imitation learning for robotics.

Similar to Victor's setup, we'd want to train on multiple demonstrations (samples) of the following form:

demonstration = {
    "states": [np.ndarray, np.ndarray, np.ndarray, ...],   # Can be variable length across samples
    "camera_1": [img1, img2, img3, ...],
    "camera_2": ...,
    "actions": [np.ndarray, ...],
}

Would love to know if there's a hack/workaround in the meantime! Thanks!

siddk avatar Jul 21 '23 17:07 siddk

You could pickle them, but as I understand, pickle will encode the images as their CHW byte arrays, which would be rather wasteful for larger images.

Let me know if this is still unresolved and I will be happy to write a custom encode/decode for you.

knighton avatar Jul 25 '23 17:07 knighton

@knighton +1 for the feature request! In my case my dataset is also a interleaved text and image one, so in one sample we may have multiple images, like [img1, img2, ...].

Since streaming does not support lists, I have to store the image list in bytes/pickle, which is much larger than jpeg. Actually, I found it is almost 20x larger!

The thing is that my dataset is quite big, thus the space is not enough to save images in bytes. Therefore, I would appreciate it if streaming can support lists or there is a demo code for how to handle this!

szxiangjn avatar Jul 29 '23 03:07 szxiangjn

@knighton @karan6181 Any updates on this?

szxiangjn avatar Aug 07 '23 19:08 szxiangjn

I update my solution here for anyone that needs help.

In streaming, each jpeg is saved as bytes, which can be seen from here:

https://github.com/mosaicml/streaming/blob/59f6ec5f8f97cc5f9a75954fef4bef3221460ff8/streaming/base/format/mds/encodings.py#L207-L223

Therefore, if we want to save a list of jpeg images, we need to concatenate all the bytes. To recover the list later, we also need to save the length of each bytes and prepend it to the beginning of each bytes. So the code is as follows:

from streaming.base.format.mds.encodings import Encoding, _encodings
from typing import List
from io import BytesIO
from PIL import Image

# Streaming data type for encoding and decoding list of JPEG images
class JPEGList(Encoding):

    def encode(self, jpeg_list: List):
        output = b''
        for jpeg in jpeg_list:
            o = BytesIO()
            jpeg.save(o, format='JPEG')
            byte = o.getvalue()

            # We need to append the length of the image to the front of the image
            leng = len(byte)
            leng = leng.to_bytes(2, byteorder='big')
            output += leng + byte
        return output
    
    def decode(self, data: bytes):
        output = []
        while len(data) > 0:
            leng = int.from_bytes(data[:2], byteorder='big')
            data = data[2:]
            byte = data[:leng]
            data = data[leng:]
            output.append(Image.open(BytesIO(byte)))
        return output

# Register the encoding
_encodings["jpeg_list"] = JPEGList

Note that the current code can only save bytes that is shorter than 65536, since I use leng = leng.to_bytes(length=2, byteorder='big'). If you want to save larger images, you may want to increase the length.

szxiangjn avatar Aug 13 '23 23:08 szxiangjn