Anmol Joshi
Anmol Joshi
@zhangguanheng66 @cpuhrsch I have incorporated changes requested in an earlier review and made some additional changes. Here is a summary: - Removed any code related to extract_archive fix and moved...
Reading from the [website](http://www.statmt.org/wmt11/translation-task.html#download), 2009 is the largest dataset. > - From Europarl (403MB) md5 sha1 > - From the News Commentary corpus (41MB) md5 sha1 > - From the...
Is the idea to have multiple functions for different years' datasets or provide an argument for the year?
Should I update the current dataset to 2009? Which other datasets would you want to provide?
@zhangguanheng66 thanks for the comments. I've added an option where users can pass the year and a table in the docstrings with details about the news crawl datasets by year....
@zhangguanheng66 I saw discussion in #691 and #690, code in #696 - Is there value in decoupling vocab and LanguageModelingDataset as well?
Thanks! Let me know if any other changes are needed on this PR!