Guanheng George Zhang

Results 42 comments of Guanheng George Zhang

> This is enough to fix the issue. I would suggest we follow the example of huggingface/transformers (https://github.com/huggingface/transformers/blob/master/transformers/tokenization_utils.py#L52-L68) and having something akin to these special symbols predefined as properties in...

LGTM. Please let me know if you need a final review to merge the PR. Thanks.

cc @bentrevett Please review the PR and let us know if you have any other suggestions.

> @zhangguanheng66 as a note, this PR is able to download files correctly and setup the dataset just fine. But, it takes a very long time to create the dataset...

> @zhangguanheng66 any thoughts on overloading the **iter** method for language modeling? Ideally, the `iter` method should be handled by `DataLoader`, rather than torchtext. We want to eventually retire those...

It seems that there is an even larger Wikitext dataset like this one, https://dl.fbaipublicfiles.com/fairseq/data/wikipedia.en_filtered.gz Any thought?

> Reading from the [website](http://www.statmt.org/wmt11/translation-task.html#download), 2009 is the largest dataset. > > > * From Europarl (403MB) md5 sha1 > > * From the News Commentary corpus (41MB) md5 sha1...

> Is the idea to have multiple functions for different years' datasets or provide an argument for the year? Correct me if I'm wrong @anmolsjoshi @cpuhrsch , I don't think...

Maybe we could add one more argument (as you did for language) so user can explicitly choose the one they like. And in the docs, we clearly mark the number...

@anmolsjoshi We want to de-couple the vocab object from dataset but are not very sure the design. I will work on some cases and pull you guys for a look.