chatbot-rnn icon indicating copy to clipboard operation
chatbot-rnn copied to clipboard

Parse Reddit Corpus in zip format

Open wmodes opened this issue 4 years ago • 0 comments

There are a few folks seeding the entirety of Reddit, but the Reddit Corpus project provides archives of individual subreddits. This gives you the very useful ability to train in a particular domain. Here is a small example: dadjokes2.corpus.zip

The only problem is that they are not in the same format as your reddit_parse.py expects. They are zipped (.zip) in a bundle of five JSON files consisting of:

  • users.json
  • conversations.json
  • corpus.json
  • index.json
  • utterances.jsonl

What is the shortest path for converting this to useable training data?

Wes

wmodes avatar Nov 08 '20 22:11 wmodes