texar-pytorch icon indicating copy to clipboard operation
texar-pytorch copied to clipboard

Refactor data modules

Open huzecong opened this issue 5 years ago • 0 comments

There are several problems with the current data modules:

  1. Filtering has to happen at data source level. This means one can't discard examples in process, but instead must supply a filter_fn and wrap the source with a FilterDataSource.

    This is problematic in two ways: 1) users don't have access to the source in predefined sources such as MonoTextData; 2) filtering cannot depend on processing results. The second point led to a previous change that made TextLineDataSource return tokenized text, because it also needs to filter out sentences containing more than a certain number of words. This prevented users from applying it to arbitrary text, such as JSONL files.

    However, it shouldn't be hard to support discarding examples in process. For the interface, we can ask the user to return None to indicate discarding. The implementation should be straightforward --- just change the BatchSampler (and maybe DynamicBatchSampler) code.

  2. There's a very subtle bug that I just realized. When a dataset has the following configurations:

    • Source does not support random access (e.g., TextLine..., Iter...).
    • Lazy strategy is "process", i.e., data is eagerly loaded but lazily processed.
    • Caching strategy is not "loaded", i.e., either don't cache at all, or cache the processed examples.
    • A dynamic batching strategy is used (e.g., TokenCountBatchingStrategy).

    What happens under the hood is:

    • The source is wrapped in a _CachedDataSource to support random access by prefetching.
    • All raw examples are read from source when the dataset is constructed.
    • Since loaded raw examples are not cached, raw examples will be deleted from _CachedDataSource when it is accessed.
    • Iteration starts, the data iterator calls reset() on _CachedDataSource. This will result in the wrapped data source being unnecessarily iterated over again, and could be problematic if the source is an IterDataSource.
    • Dynamic batching strategies require access to the processed examples to determine whether it should be added to the batch. This access erases raw examples from _CachedDataSource.
    • The PyTorch data loader then tries to access the processed examples, but they're already deleted. This results in an exception.

    This case was not covered in our tests, because there are simply too many combinations when it comes to the data module. The current design requires a lot of checks and special handling of corner cases, and is not extensible to new features.

    A possible new design could be:

    1. For lazy loading and caching loaded raw examples, instead of writing the logic in both the iterator and dataset, isolate them into a third class. This class could be another wrapper over the source.
    2. We might be able to do the same for lazy processing and caching processed examples.
    3. Some of the current design was made to workaround certain limitations of the PyTorch data loader. We might as well just rewrite more parts of the data loader, since we already do to preserve compatibility with multiple PyTorch versions.

huzecong avatar Oct 07 '19 02:10 huzecong