litdata Allow a StreamingDataset to wrap around when running in a CombinedStreamingDataset

🚀 Feature

Consider adding the ability to wrap around a StreamingDataset without issuing a StopIteration when combining datasets.

This is something we haven't ported from PackedDataset https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/packed_dataset.py#L190

Motivation

This is useful to combine a smaller dataset with a very large one, so that we can make sure certain batches of data make it into the training process frequently enough, but without invalidating the epoch when the other datasets are multi-billion tokens in size.

Pitch

Add a wrap property to a StreamingDataset that will not have it raise a StopIteration but just keep looping through the data.

Alternatives

Add handling of this at the CombinedStreamingDataset level, so that each dataset raises StopIteration when it has to, but we don't invalidate the others. In both cases we need to decide what happens to epoch within the dataset.

Mar 14 '24 21:03 lantiga

Hi! thanks for your contribution!, great first issue!

Mar 14 '24 21:03 github-actions[bot]

Can this also be supported with the ability to also specify train_weight_factors? Also, what is the status on this request?

Aug 23 '24 20:08 hubenjm

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 16 '25 05:04 stale[bot]

Let's keep this issue open — it's worth exploring at somepoint.

Apr 17 '25 07:04 bhimrazy