text icon indicating copy to clipboard operation
text copied to clipboard

[WIP] Simplifications and code formatting for experimental text classification datasets

Open cpuhrsch opened this issue 4 years ago • 4 comments

There are five generic functions introduced in the current code

vocab_func - returns a function that calls __getitem__ on each entry of a given list using a particular vocab object. This could be replaces with map(vocab, my_list) if vocab supported the call operator. Seems like an artifact of legacy code that hopefully won't exist in experimental. totensor - effectively returns torch.tensor. I think this is unnecessary. ngrams_func - we have ngrams_iterator already build_vocab - implements build_vocab_from_iterator(transforms(txt) for (_, txt) in data) squential_transforms - effectively functools.reduce and similar to nn.Sequential for nn.Modules and torchvision also has a similar combination function.

I don't know if we need merge any of these since they don't represent a lot of code so they were removed

Further I'm using the following script to track performance of dataset initialization

import torch
import torchtext

from torchtext.experimental.datasets import text_classification

for name, fn in text_classification.DATASETS.items():
    import time
    t0 = time.monotonic()
    datae = fn()
    print(name + " - " + str(time.monotonic() - t0))

~~For that I discovered that we might want to change text_classification.DATASETS to point to the datasets contained in text_classification.py instead of raw/text_classification.py.~~ (Merged in https://github.com/pytorch/text/pull/775)

cpuhrsch avatar Apr 22 '20 05:04 cpuhrsch

@zhangguanheng66 - let me know what you think about this and if you want to merge it

cpuhrsch avatar Apr 24 '20 00:04 cpuhrsch

Using the small benchmark script I found that this PR gives us a significant speedup. I didn't exactly characterize this (it's running on my laptop), but given the order I think we should not need higher resolution instrumentation. This is for dataset creation only.

master

AG_NEWS - 6.124168223000001
SogouNews - 199.483479047
DBpedia - 38.38662358899998
YelpReviewPolarity - 70.71134889599998
YelpReviewFull - 86.49273644800002
YahooAnswers - 146.78631064299992
AmazonReviewPolarity - 319.45054842900004
AmazonReviewFull - 293.204584815
IMDB - 24.152421628999946

up1

AG_NEWS - 6.304364777
SogouNews - 158.734018144
DBpedia - 29.756586211000013
YelpReviewPolarity - 59.01243050400001
YelpReviewFull - 69.66327271199998
YahooAnswers - 114.90884431500001
AmazonReviewPolarity - 237.30875152599998
AmazonReviewFull - 198.39240907499993
IMDB - 19.537461151000002

I suspect (but didn't verify) that this is because the transforms during creation are setup to avoid some materializations (e.g. ngrams_iterator).

cpuhrsch avatar May 15 '20 02:05 cpuhrsch

Using the small benchmark script I found that this PR gives us a significant speedup. I didn't exactly characterize this (it's running on my laptop), but given the order I think we should not need higher resolution instrumentation. This is for dataset creation only.

master

AG_NEWS - 6.124168223000001
SogouNews - 199.483479047
DBpedia - 38.38662358899998
YelpReviewPolarity - 70.71134889599998
YelpReviewFull - 86.49273644800002
YahooAnswers - 146.78631064299992
AmazonReviewPolarity - 319.45054842900004
AmazonReviewFull - 293.204584815
IMDB - 24.152421628999946

up1

AG_NEWS - 6.304364777
SogouNews - 158.734018144
DBpedia - 29.756586211000013
YelpReviewPolarity - 59.01243050400001
YelpReviewFull - 69.66327271199998
YahooAnswers - 114.90884431500001
AmazonReviewPolarity - 237.30875152599998
AmazonReviewFull - 198.39240907499993
IMDB - 19.537461151000002

I suspect (but didn't verify) that this is because the transforms during creation are setup to avoid some materializations (e.g. ngrams_iterator).

I assume in both cases you didn't count the time to download/unzip files.

zhangguanheng66 avatar May 15 '20 13:05 zhangguanheng66

@zhangguanheng66 - no, the underlying archives were already downloaded and also extracted. Creating the raw datasets in this circumstance is very fast. So this extra time is mostly about creating the vocab etc.

cpuhrsch avatar May 15 '20 15:05 cpuhrsch