streaming
streaming copied to clipboard
Per-stream processing
🚀 Feature Request
When I use multiple Streams to create a StreamingDataset, I want to be able to use a different pre-processing function to process the data in each Stream. For example, Stream A needs special label masking while Stream B doesn't.
Motivation
This is commonly needed for multi-task training, for example, UL2 training. Currently, my workaround is to insert a task / source column to those streams and use my own StreamingDataset class to produce labels differently based on the task / source column. However, this requires changes to the materialized datasets.
Hi @lorabit110 , are you planning to apply a pre-processing function during def __getitem__()
function, something like this ?
Yes. But in your example, it's a StreamingDataset-specific pre-processing function. What I need is to provide a Stream-specific pre-processing function. Or is there a way to create a mixture with multiple StreamingDatasets?
@lorabit110 , wondering, have you tried ChainDataset ? where you can pass sequence of StreamingDataset class? You can have your own pre-processing logic per StreamingDataset class.
@lorabit110, I am checking if you have had a chance to try the above solution.
Hey @karan6181 -- the ChainDataset
solution means that I lose any proportional sampling behavior I'd get by loading multiple streams in a single StreamingDataset()
.
Is there no other way to apply a Stream-specific transform while keeping all the StreamingDataset()
machinery?
Hey @siddk, there currently isn't a per-stream processing function, but it's something we can add in the future!
@siddk / @lorabit110 I'm Wondering if you're open to adding this feature to streaming. We can help you out however we can. It would be cool to have this feature in Streaming.