streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Per-stream processing

Open lorabit110 opened this issue 1 year ago • 7 comments

🚀 Feature Request

When I use multiple Streams to create a StreamingDataset, I want to be able to use a different pre-processing function to process the data in each Stream. For example, Stream A needs special label masking while Stream B doesn't.

Motivation

This is commonly needed for multi-task training, for example, UL2 training. Currently, my workaround is to insert a task / source column to those streams and use my own StreamingDataset class to produce labels differently based on the task / source column. However, this requires changes to the materialized datasets.

lorabit110 avatar Aug 21 '23 18:08 lorabit110

Hi @lorabit110 , are you planning to apply a pre-processing function during def __getitem__() function, something like this ?

karan6181 avatar Aug 24 '23 22:08 karan6181

Yes. But in your example, it's a StreamingDataset-specific pre-processing function. What I need is to provide a Stream-specific pre-processing function. Or is there a way to create a mixture with multiple StreamingDatasets?

lorabit110 avatar Aug 26 '23 19:08 lorabit110

@lorabit110 , wondering, have you tried ChainDataset ? where you can pass sequence of StreamingDataset class? You can have your own pre-processing logic per StreamingDataset class.

karan6181 avatar Sep 11 '23 15:09 karan6181

@lorabit110, I am checking if you have had a chance to try the above solution.

karan6181 avatar Oct 05 '23 16:10 karan6181

Hey @karan6181 -- the ChainDataset solution means that I lose any proportional sampling behavior I'd get by loading multiple streams in a single StreamingDataset().

Is there no other way to apply a Stream-specific transform while keeping all the StreamingDataset() machinery?

siddk avatar Apr 25 '24 15:04 siddk

Hey @siddk, there currently isn't a per-stream processing function, but it's something we can add in the future!

snarayan21 avatar May 29 '24 19:05 snarayan21

@siddk / @lorabit110 I'm Wondering if you're open to adding this feature to streaming. We can help you out however we can. It would be cool to have this feature in Streaming.

karan6181 avatar Jul 23 '24 03:07 karan6181