Moritz Gunz

Results 133 comments of Moritz Gunz

The issue was likely introduced in https://github.com/rwth-i6/returnn/pull/1630, which added the `_num_shards` and `_shard_index` as `__init__` parameters to `Dataset`. Parameters w/ a leading underscore are pickled by `Dataset.__reduce__`, which means the...

Thank you for your comment, that makes sense to me. I was thinking of a scheme where you do parameter averaging after different steps depending on whether you are averaging...

> That's all what is needed, right? Hmm, when would you set up the sub process groups needed for synchronization? E.g. on the first invocation of the function? In the...

> So what you actually say is that using num_workers > 0 is the workaround for now? persistent_workers does not need to be set. Yes, exactly, because we have the...

I filed https://github.com/pytorch/pytorch/issues/129868

Apparently this is a won‘t fix on the torch side. Our workaround is the officially recommended solution.

In RETURNN startup we first initialize the dataset, and only afterwards we initialize the engine. This makes it quite difficult to use distrib training primitives for regular syncs. Perhaps we...

Hi Otto, guessing from here this looks like transient connection problems on either Spotify's, Firebase's or your internet connection's side, which Festify then eventually chokes on. If it really is...