streaming
streaming copied to clipboard
Add options to precompute the epoch
Add the option to pre-generate the epoch. This should save us a lot of time when there is a lot of work happening between creating the StreamingDataset and iterating it.
Pre-generating can happen concurrently with the last third of init and beyond by providing which epoch and sample offset to generate (init_pregen_epoch
init_pregen_sample
). Note that this is before any load_state_dict()
so if there is going to be a resumption happening to not 0:0
, we won't know it at that time, although the user might. Also, we can't just yolo all the epochs at once because of RAM/scale concerns. Finally, we need to be provided DataLoader num_workers
for this to work, as we won't otherwise know it in a rank process without resorting to the garbage collector trampoline.
Pre-generating can happen on the fly as well, more easily so, by setting the bool arg pregen_next_epoch
, which simply pre-generates epoch + 1:0
in the background when done generating (or loading pre-generated) the current epoch.
Details are managed by pregen_epoch_timeout
(defaults to 12 min) and pregen_epoch_tick
(defaults to 0xCAFE / 1337 / 42
, or just under a second).
init_pregen_epoch: Optional[int] = None,
init_pregen_sample: Optional[int] = None,
pregen_next_epoch: bool = True,
pregen_epoch_timeout: Optional[float] = float(np.arange(1, 7).prod()),
pregen_epoch_tick: float = 0xCAFE / 1337 / 42,
num_workers: Optional[int] = None,