Anmol Joshi comments

Results 17 comments of


                                            Anmol Joshi

Added WMT News Crawl Dataset for language modeling

@zhangguanheng66 @cpuhrsch I have incorporated changes requested in an earlier review and made some additional changes. Here is a summary: - Removed any code related to extract_archive fix and moved...

Added WMT News Crawl Dataset for language modeling

Reading from the [website](http://www.statmt.org/wmt11/translation-task.html#download), 2009 is the largest dataset. > - From Europarl (403MB) md5 sha1 > - From the News Commentary corpus (41MB) md5 sha1 > - From the...

Added WMT News Crawl Dataset for language modeling

Is the idea to have multiple functions for different years' datasets or provide an argument for the year?

Added WMT News Crawl Dataset for language modeling

Should I update the current dataset to 2009? Which other datasets would you want to provide?

Added WMT News Crawl Dataset for language modeling

@zhangguanheng66 thanks for the comments. I've added an option where users can pass the year and a table in the docstrings with details about the news crawl datasets by year....

Added WMT News Crawl Dataset for language modeling

@zhangguanheng66 I saw discussion in #691 and #690, code in #696 - Is there value in decoupling vocab and LanguageModelingDataset as well?

Added WMT News Crawl Dataset for language modeling

Thanks! Let me know if any other changes are needed on this PR!