Allow Stream's `repeat` option to cycle through entire dataset before repeating, when `shuffle=True`
🚀 Feature Request
I am using the repeat option when creating a stream, i.e. Stream(repeat=2) in addition to random shuffle, i.e. StreamingDataset(shuffle=True). It appears that there is no constraint about surfacing every sample once before repeating, that is, the ideal before for my use case is going through every sample once in a shuffled manner before starting to see a sample a second time. Is there already some way to achieve this behavior, and if not, would it be possible to add? Thanks!
Motivation
For various reasons I am constructing datasets that should have samples duplicated a certain number of times, but each sample should be seen once before any are seen a second time.
[Optional] Implementation
Additional context
@m-harmonic Does keeping repeat=1 and iterating over multiple epoch helps in anyway? Or are you using multiple streams and each stream have >= 1 repeat?
@karan6181 Yes exactly, we do have cases where we have multiple streams some of which have multiple repeats. Separately we are also experiencing a problem that is forcing us to train within a single epoch, so duplicating the data within one epoch is the workaround we're trying to use. Do you think there is a possible fix, or an easy solution?
Hey, @m-harmonic, thanks for the clarification. Unfortunately, we don't support that use case at the moment. I wonder why you care each sample should be seen once before any are seen a second time? Have you tried our new shuffling algorithm py1e and py1br, which provides excellent shuffle quality? I doubt that you will see any convergence issues with sample ordering. I recommend using our new streaming simulator to find the correct set of hyperparameters for the best performance.
@m-harmonic Can you also explain why you would want the repeated samples to show up after going through the original dataset? Can you please share your use case and what exactly you are trying to do? Thanks!
@m-harmonic Gentle reminder on the above question.