texar-pytorch
texar-pytorch copied to clipboard
Refactor data modules
There are several problems with the current data modules:
-
Filtering has to happen at data source level. This means one can't discard examples in
process
, but instead must supply afilter_fn
and wrap the source with aFilterDataSource
.This is problematic in two ways: 1) users don't have access to the source in predefined sources such as
MonoTextData
; 2) filtering cannot depend on processing results. The second point led to a previous change that madeTextLineDataSource
return tokenized text, because it also needs to filter out sentences containing more than a certain number of words. This prevented users from applying it to arbitrary text, such asJSONL
files.However, it shouldn't be hard to support discarding examples in
process
. For the interface, we can ask the user to returnNone
to indicate discarding. The implementation should be straightforward --- just change theBatchSampler
(and maybeDynamicBatchSampler
) code. -
There's a very subtle bug that I just realized. When a dataset has the following configurations:
- Source does not support random access (e.g.,
TextLine...
,Iter...
). - Lazy strategy is
"process"
, i.e., data is eagerly loaded but lazily processed. - Caching strategy is not
"loaded"
, i.e., either don't cache at all, or cache the processed examples. - A dynamic batching strategy is used (e.g.,
TokenCountBatchingStrategy
).
What happens under the hood is:
- The source is wrapped in a
_CachedDataSource
to support random access by prefetching. - All raw examples are read from source when the dataset is constructed.
- Since loaded raw examples are not cached, raw examples will be deleted from
_CachedDataSource
when it is accessed. - Iteration starts, the data iterator calls
reset()
on_CachedDataSource
. This will result in the wrapped data source being unnecessarily iterated over again, and could be problematic if the source is anIterDataSource
. - Dynamic batching strategies require access to the processed examples to determine whether it should be added to the batch. This access erases raw examples from
_CachedDataSource
. - The PyTorch data loader then tries to access the processed examples, but they're already deleted. This results in an exception.
This case was not covered in our tests, because there are simply too many combinations when it comes to the data module. The current design requires a lot of checks and special handling of corner cases, and is not extensible to new features.
A possible new design could be:
- For lazy loading and caching loaded raw examples, instead of writing the logic in both the iterator and dataset, isolate them into a third class. This class could be another wrapper over the source.
- We might be able to do the same for lazy processing and caching processed examples.
- Some of the current design was made to workaround certain limitations of the PyTorch data loader. We might as well just rewrite more parts of the data loader, since we already do to preserve compatibility with multiple PyTorch versions.
- Source does not support random access (e.g.,