text
text copied to clipboard
[WIP] Simplifications and code formatting for experimental text classification datasets
There are five generic functions introduced in the current code
vocab_func - returns a function that calls __getitem__
on each entry of a given list using a particular vocab object. This could be replaces with map(vocab, my_list) if vocab supported the call operator. Seems like an artifact of legacy code that hopefully won't exist in experimental.
totensor - effectively returns torch.tensor. I think this is unnecessary.
ngrams_func - we have ngrams_iterator already
build_vocab - implements build_vocab_from_iterator(transforms(txt) for (_, txt) in data)
squential_transforms - effectively functools.reduce and similar to nn.Sequential for nn.Modules and torchvision also has a similar combination function.
I don't know if we need merge any of these since they don't represent a lot of code so they were removed
Further I'm using the following script to track performance of dataset initialization
import torch
import torchtext
from torchtext.experimental.datasets import text_classification
for name, fn in text_classification.DATASETS.items():
import time
t0 = time.monotonic()
datae = fn()
print(name + " - " + str(time.monotonic() - t0))
~~For that I discovered that we might want to change text_classification.DATASETS
to point to the datasets contained in text_classification.py
instead of raw/text_classification.py
.~~ (Merged in https://github.com/pytorch/text/pull/775)
@zhangguanheng66 - let me know what you think about this and if you want to merge it
Using the small benchmark script I found that this PR gives us a significant speedup. I didn't exactly characterize this (it's running on my laptop), but given the order I think we should not need higher resolution instrumentation. This is for dataset creation only.
master
AG_NEWS - 6.124168223000001
SogouNews - 199.483479047
DBpedia - 38.38662358899998
YelpReviewPolarity - 70.71134889599998
YelpReviewFull - 86.49273644800002
YahooAnswers - 146.78631064299992
AmazonReviewPolarity - 319.45054842900004
AmazonReviewFull - 293.204584815
IMDB - 24.152421628999946
up1
AG_NEWS - 6.304364777
SogouNews - 158.734018144
DBpedia - 29.756586211000013
YelpReviewPolarity - 59.01243050400001
YelpReviewFull - 69.66327271199998
YahooAnswers - 114.90884431500001
AmazonReviewPolarity - 237.30875152599998
AmazonReviewFull - 198.39240907499993
IMDB - 19.537461151000002
I suspect (but didn't verify) that this is because the transforms during creation are setup to avoid some materializations (e.g. ngrams_iterator).
Using the small benchmark script I found that this PR gives us a significant speedup. I didn't exactly characterize this (it's running on my laptop), but given the order I think we should not need higher resolution instrumentation. This is for dataset creation only.
master
AG_NEWS - 6.124168223000001 SogouNews - 199.483479047 DBpedia - 38.38662358899998 YelpReviewPolarity - 70.71134889599998 YelpReviewFull - 86.49273644800002 YahooAnswers - 146.78631064299992 AmazonReviewPolarity - 319.45054842900004 AmazonReviewFull - 293.204584815 IMDB - 24.152421628999946
up1
AG_NEWS - 6.304364777 SogouNews - 158.734018144 DBpedia - 29.756586211000013 YelpReviewPolarity - 59.01243050400001 YelpReviewFull - 69.66327271199998 YahooAnswers - 114.90884431500001 AmazonReviewPolarity - 237.30875152599998 AmazonReviewFull - 198.39240907499993 IMDB - 19.537461151000002
I suspect (but didn't verify) that this is because the transforms during creation are setup to avoid some materializations (e.g. ngrams_iterator).
I assume in both cases you didn't count the time to download/unzip files.
@zhangguanheng66 - no, the underlying archives were already downloaded and also extracted. Creating the raw datasets in this circumstance is very fast. So this extra time is mostly about creating the vocab etc.