metaseq
metaseq copied to clipboard
Make epoch mean epoch again (in streaming language modeling)
We currently have the following unfortunate naming: https://github.com/facebookresearch/metaseq/blob/4288451502667dda2be71a0a1a9df5066b583ae8/metaseq/tasks/streaming_language_modeling.py#L271-L290
where our training corpus is chunked up into shards, but each shard gets referenced as an epoch.
We should fix this confusing naming to make it clear that an epoch consists of shards (before repeating / looping over the same dataset again).
Relates to https://github.com/facebookresearch/metaseq/pull/189 and https://github.com/facebookresearch/metaseq/issues/166.
Note: be careful of rng state here. On restart, we want to make sure shuffled dataset is in the same order.
Is the intent of this issue to just rename epoch into something more meaningful and intuitive like shard_index?
@KUNAL1612 To some degree - you'll have to look at why it was named epoch in the first place (i.e. how epoch is used outside of this class). My cursory understanding is that we also have some kind of "shuffling" / rng state that is tracked / refreshed per "epoch", which gets hijacked here to be applied across "shards".