metaseq icon indicating copy to clipboard operation
metaseq copied to clipboard

Make epoch mean epoch again (in streaming language modeling)

Open suchenzang opened this issue 2 years ago • 2 comments

We currently have the following unfortunate naming: https://github.com/facebookresearch/metaseq/blob/4288451502667dda2be71a0a1a9df5066b583ae8/metaseq/tasks/streaming_language_modeling.py#L271-L290

where our training corpus is chunked up into shards, but each shard gets referenced as an epoch.

We should fix this confusing naming to make it clear that an epoch consists of shards (before repeating / looping over the same dataset again).

Relates to https://github.com/facebookresearch/metaseq/pull/189 and https://github.com/facebookresearch/metaseq/issues/166.

Note: be careful of rng state here. On restart, we want to make sure shuffled dataset is in the same order.

suchenzang avatar Jul 03 '22 14:07 suchenzang

Is the intent of this issue to just rename epoch into something more meaningful and intuitive like shard_index?

KUNAL1612 avatar Jul 12 '22 18:07 KUNAL1612

@KUNAL1612 To some degree - you'll have to look at why it was named epoch in the first place (i.e. how epoch is used outside of this class). My cursory understanding is that we also have some kind of "shuffling" / rng state that is tracked / refreshed per "epoch", which gets hijacked here to be applied across "shards".

suchenzang avatar Jul 13 '22 20:07 suchenzang