text
text copied to clipboard
Parallel vocab construction for torchtext.experimental.datasets.translation
Edit raw.translation dataset to return a RawTextIterableDataset, which uses worker information to restrict the underlying iterator to a subset such that DataLoader won't return duplicate entries, if given an instance of RawTextIterableDataset.
This requires RawTextIterableDataset to take a function which returns an iterator instead of an iterator, since a new iterator needs to be constructed in each python process within DataLoader (the code calls uses the term "lazy" to describe this).
This iterable dataset is then passed into an instance of DataLoader that uses a given collate_fn to tokenize and return a Counter with all token counts for a given batch of data (lines of text).
data_construction.py was modified to constraint itself to the translation datasets for this PR.
This branch (40 threads)
$ numactl --membind 0 --cpubind 0 python benchmark/data_construction.py 2> /dev/null
Multi30k construction time 4.32s
IWSLT construction time 15.89s
WMT14 construction time 246.84s
Master
$ numactl --membind 0 --cpubind 0 python benchmark/data_construction.py 2> /dev/null
Multi30k construction time 7.46s
IWSLT construction time 73.21s
WMT14 construction time 2551.39s
Spacy changes the API and drops the support for the shortcut ("en"). Will sent a PR to fix it.
closing as changed files don't exist anymore