text icon indicating copy to clipboard operation
text copied to clipboard

Parallel vocab construction for torchtext.experimental.datasets.translation

Open cpuhrsch opened this issue 5 years ago • 1 comments

Edit raw.translation dataset to return a RawTextIterableDataset, which uses worker information to restrict the underlying iterator to a subset such that DataLoader won't return duplicate entries, if given an instance of RawTextIterableDataset.

This requires RawTextIterableDataset to take a function which returns an iterator instead of an iterator, since a new iterator needs to be constructed in each python process within DataLoader (the code calls uses the term "lazy" to describe this).

This iterable dataset is then passed into an instance of DataLoader that uses a given collate_fn to tokenize and return a Counter with all token counts for a given batch of data (lines of text).

data_construction.py was modified to constraint itself to the translation datasets for this PR.

This branch (40 threads)

$ numactl --membind 0 --cpubind 0 python benchmark/data_construction.py 2> /dev/null
Multi30k construction time 4.32s
IWSLT construction time 15.89s
WMT14 construction time 246.84s

Master

$ numactl --membind 0 --cpubind 0 python benchmark/data_construction.py 2> /dev/null
Multi30k construction time 7.46s
IWSLT construction time 73.21s
WMT14 construction time 2551.39s

cpuhrsch avatar Sep 29 '20 01:09 cpuhrsch

Spacy changes the API and drops the support for the shortcut ("en"). Will sent a PR to fix it.

zhangguanheng66 avatar Feb 02 '21 16:02 zhangguanheng66

closing as changed files don't exist anymore

rshraga avatar Mar 14 '23 18:03 rshraga