meena-chatbot icon indicating copy to clipboard operation
meena-chatbot copied to clipboard

Where is the source file on the nlpl page exactly?

Open tgmerritt opened this issue 2 years ago • 1 comments

The notebook references http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.it.gz as the source, when I visit the linked opus.nlpl.eu page I see this grid with a bunch of LANG.xml.gz files - I cannot seem to locate a different file than Italian - can you link me to the exact page where I can find alternatives to Italian language so that I can train the model with a different data source please?

tgmerritt avatar Jul 16 '21 19:07 tgmerritt

https://opus.nlpl.eu/OpenSubtitles-v2018.php is the page with all the conversational dataset provided by OpenSubtitles. Look for the first row in the second table, corresponding to the monolingual plain text files (tokenized).

frankplus avatar Jul 17 '21 08:07 frankplus