Albert Zeyer

Results 972 comments of Albert Zeyer

How would the user specify such post-processing function per dataset? It could be another argument for the dataset itself, so the user specifies it like: ```python train = { ...,...

The post-processing function is not per task but per dataset. At least that is what I wrote above. Or do you want to have it per task? But I guess...

> in the engine class you know for what the dataset is used and from which name in the config it comes from (I hope) No, you don't. E.g. we...

One aspect I realized now: Where exactly would this be executed? As this is now outside the dataset, `MultiProcDataset` cannot really make use of this, so it cannot be parallelized...

Another aspect came up (@Judyxujj): We were interested in implementing mixup in this post processing function. But this is not really possible with the current design. This additionally needs: *...

> That would interact favourably with multiprocessing, No, not really. Only the `DataLoader` multiproc would apply here, which is usually just a single proc. But we want to have multiple...

> I was under the assumption that in RETURNN+PT the data loader `num_workers` is basically a replacement for `MultiProcDataset`. I.e. in the cases where I want to use more than...

Btw, after some discussion yesterday with @curufinwe, I think a pragmatic simple solution for now is really to implement this as a new separate dataset, like this `PostProcessingDataset`. This directly...

For some other examples of similar processing datasets, see: `VariableDataset`, `MetaDataset`, `AnythingDataset`, `ConcatSeqsDataset`. Btw, in the main post, I extended a bit the list of example post processing functions. One...

> However, this should be implemented in a streaming way, i.e. it gets in a sequence of `TensorDict`, and should output a new sequence of `TensorDict`. The question is a...