Albert Zeyer
Albert Zeyer
How would the user specify such post-processing function per dataset? It could be another argument for the dataset itself, so the user specifies it like: ```python train = { ...,...
The post-processing function is not per task but per dataset. At least that is what I wrote above. Or do you want to have it per task? But I guess...
> in the engine class you know for what the dataset is used and from which name in the config it comes from (I hope) No, you don't. E.g. we...
One aspect I realized now: Where exactly would this be executed? As this is now outside the dataset, `MultiProcDataset` cannot really make use of this, so it cannot be parallelized...
Another aspect came up (@Judyxujj): We were interested in implementing mixup in this post processing function. But this is not really possible with the current design. This additionally needs: *...
> That would interact favourably with multiprocessing, No, not really. Only the `DataLoader` multiproc would apply here, which is usually just a single proc. But we want to have multiple...
> I was under the assumption that in RETURNN+PT the data loader `num_workers` is basically a replacement for `MultiProcDataset`. I.e. in the cases where I want to use more than...
Btw, after some discussion yesterday with @curufinwe, I think a pragmatic simple solution for now is really to implement this as a new separate dataset, like this `PostProcessingDataset`. This directly...
For some other examples of similar processing datasets, see: `VariableDataset`, `MetaDataset`, `AnythingDataset`, `ConcatSeqsDataset`. Btw, in the main post, I extended a bit the list of example post processing functions. One...
> However, this should be implemented in a streaming way, i.e. it gets in a sequence of `TensorDict`, and should output a new sequence of `TensorDict`. The question is a...