chatbot-rnn
chatbot-rnn copied to clipboard
Parse Reddit Corpus in zip format
There are a few folks seeding the entirety of Reddit, but the Reddit Corpus project provides archives of individual subreddits. This gives you the very useful ability to train in a particular domain. Here is a small example: dadjokes2.corpus.zip
The only problem is that they are not in the same format as your reddit_parse.py expects. They are zipped (.zip) in a bundle of five JSON files consisting of:
- users.json
- conversations.json
- corpus.json
- index.json
- utterances.jsonl
What is the shortest path for converting this to useable training data?
Wes