streaming Per-stream processing

🚀 Feature Request

When I use multiple Streams to create a StreamingDataset, I want to be able to use a different pre-processing function to process the data in each Stream. For example, Stream A needs special label masking while Stream B doesn't.

Motivation

This is commonly needed for multi-task training, for example, UL2 training. Currently, my workaround is to insert a task / source column to those streams and use my own StreamingDataset class to produce labels differently based on the task / source column. However, this requires changes to the materialized datasets.

Aug 21 '23 18:08 lorabit110

Hi @lorabit110 , are you planning to apply a pre-processing function during def __getitem__() function, something like this ?

Aug 24 '23 22:08 karan6181

Yes. But in your example, it's a StreamingDataset-specific pre-processing function. What I need is to provide a Stream-specific pre-processing function. Or is there a way to create a mixture with multiple StreamingDatasets?

Aug 26 '23 19:08 lorabit110

@lorabit110 , wondering, have you tried ChainDataset ? where you can pass sequence of StreamingDataset class? You can have your own pre-processing logic per StreamingDataset class.

Sep 11 '23 15:09 karan6181

@lorabit110, I am checking if you have had a chance to try the above solution.

Oct 05 '23 16:10 karan6181

Hey @karan6181 -- the ChainDataset solution means that I lose any proportional sampling behavior I'd get by loading multiple streams in a single StreamingDataset().

Is there no other way to apply a Stream-specific transform while keeping all the StreamingDataset() machinery?

Apr 25 '24 15:04 siddk

Hey @siddk, there currently isn't a per-stream processing function, but it's something we can add in the future!

May 29 '24 19:05 snarayan21

@siddk / @lorabit110 I'm Wondering if you're open to adding this feature to streaming. We can help you out however we can. It would be cool to have this feature in Streaming.

Jul 23 '24 03:07 karan6181

streaming streaming copied to clipboard

Per-stream processing

🚀 Feature Request

Motivation

streaming
streaming copied to clipboard