Allow a StreamingDataset to wrap around when running in a CombinedStreamingDataset
🚀 Feature
Consider adding the ability to wrap around a StreamingDataset without issuing a StopIteration when combining datasets.
This is something we haven't ported from PackedDataset https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/packed_dataset.py#L190
Motivation
This is useful to combine a smaller dataset with a very large one, so that we can make sure certain batches of data make it into the training process frequently enough, but without invalidating the epoch when the other datasets are multi-billion tokens in size.
Pitch
Add a wrap property to a StreamingDataset that will not have it raise a StopIteration but just keep looping through the data.
Alternatives
Add handling of this at the CombinedStreamingDataset level, so that each dataset raises StopIteration when it has to, but we don't invalidate the others. In both cases we need to decide what happens to epoch within the dataset.
Hi! thanks for your contribution!, great first issue!
Can this also be supported with the ability to also specify train_weight_factors? Also, what is the status on this request?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Let's keep this issue open — it's worth exploring at somepoint.