litdata icon indicating copy to clipboard operation
litdata copied to clipboard

Allow a StreamingDataset to wrap around when running in a CombinedStreamingDataset

Open lantiga opened this issue 1 year ago • 5 comments

🚀 Feature

Consider adding the ability to wrap around a StreamingDataset without issuing a StopIteration when combining datasets.

This is something we haven't ported from PackedDataset https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/packed_dataset.py#L190

Motivation

This is useful to combine a smaller dataset with a very large one, so that we can make sure certain batches of data make it into the training process frequently enough, but without invalidating the epoch when the other datasets are multi-billion tokens in size.

Pitch

Add a wrap property to a StreamingDataset that will not have it raise a StopIteration but just keep looping through the data.

Alternatives

Add handling of this at the CombinedStreamingDataset level, so that each dataset raises StopIteration when it has to, but we don't invalidate the others. In both cases we need to decide what happens to epoch within the dataset.

lantiga avatar Mar 14 '24 21:03 lantiga

Hi! thanks for your contribution!, great first issue!

github-actions[bot] avatar Mar 14 '24 21:03 github-actions[bot]

Can this also be supported with the ability to also specify train_weight_factors? Also, what is the status on this request?

hubenjm avatar Aug 23 '24 20:08 hubenjm

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '25 05:04 stale[bot]

Let's keep this issue open — it's worth exploring at somepoint.

bhimrazy avatar Apr 17 '25 07:04 bhimrazy