streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Add options to precompute the epoch

Open knighton opened this issue 1 year ago • 0 comments

Add the option to pre-generate the epoch. This should save us a lot of time when there is a lot of work happening between creating the StreamingDataset and iterating it.

Pre-generating can happen concurrently with the last third of init and beyond by providing which epoch and sample offset to generate (init_pregen_epoch init_pregen_sample). Note that this is before any load_state_dict() so if there is going to be a resumption happening to not 0:0, we won't know it at that time, although the user might. Also, we can't just yolo all the epochs at once because of RAM/scale concerns. Finally, we need to be provided DataLoader num_workers for this to work, as we won't otherwise know it in a rank process without resorting to the garbage collector trampoline.

Pre-generating can happen on the fly as well, more easily so, by setting the bool arg pregen_next_epoch, which simply pre-generates epoch + 1:0 in the background when done generating (or loading pre-generated) the current epoch.

Details are managed by pregen_epoch_timeout (defaults to 12 min) and pregen_epoch_tick (defaults to 0xCAFE / 1337 / 42, or just under a second).

                 init_pregen_epoch: Optional[int] = None,
                 init_pregen_sample: Optional[int] = None,
                 pregen_next_epoch: bool = True,
                 pregen_epoch_timeout: Optional[float] = float(np.arange(1, 7).prod()),
                 pregen_epoch_tick: float = 0xCAFE / 1337 / 42,
                 num_workers: Optional[int] = None,

knighton avatar Jan 20 '24 15:01 knighton