Guanheng George Zhang comments

Results 42 comments of


                                            Guanheng George Zhang

Remove unk hardcode

> This is enough to fix the issue. I would suggest we follow the example of huggingface/transformers (https://github.com/huggingface/transformers/blob/master/transformers/tokenization_utils.py#L52-L68) and having something akin to these special symbols predefined as properties in...

Remove unk hardcode

LGTM. Please let me know if you need a final review to merge the PR. Thanks.

Remove <unk> token and index from experimental Vocab

cc @bentrevett Please review the PR and let us know if you have any other suggestions.

Added WMT News Crawl Dataset for language modeling

> @zhangguanheng66 as a note, this PR is able to download files correctly and setup the dataset just fine. But, it takes a very long time to create the dataset...

Added WMT News Crawl Dataset for language modeling

> @zhangguanheng66 any thoughts on overloading the **iter** method for language modeling? Ideally, the `iter` method should be handled by `DataLoader`, rather than torchtext. We want to eventually retire those...

Added WMT News Crawl Dataset for language modeling

It seems that there is an even larger Wikitext dataset like this one, https://dl.fbaipublicfiles.com/fairseq/data/wikipedia.en_filtered.gz Any thought?

Added WMT News Crawl Dataset for language modeling

> Reading from the [website](http://www.statmt.org/wmt11/translation-task.html#download), 2009 is the largest dataset. > > > * From Europarl (403MB) md5 sha1 > > * From the News Commentary corpus (41MB) md5 sha1...

Added WMT News Crawl Dataset for language modeling

> Is the idea to have multiple functions for different years' datasets or provide an argument for the year? Correct me if I'm wrong @anmolsjoshi @cpuhrsch , I don't think...

Added WMT News Crawl Dataset for language modeling

Maybe we could add one more argument (as you did for language) so user can explicitly choose the one they like. And in the docs, we clearly mark the number...

Added WMT News Crawl Dataset for language modeling

@anmolsjoshi We want to de-couple the vocab object from dataset but are not very sure the design. I will work on some cases and pull you guys for a look.