chatbot-rnn Parse Reddit Corpus in zip format

Parse Reddit Corpus in zip format

Open wmodes opened this issue 4 years ago • 0 comments

There are a few folks seeding the entirety of Reddit, but the Reddit Corpus project provides archives of individual subreddits. This gives you the very useful ability to train in a particular domain. Here is a small example: dadjokes2.corpus.zip

The only problem is that they are not in the same format as your reddit_parse.py expects. They are zipped (.zip) in a bundle of five JSON files consisting of:

users.json
conversations.json
corpus.json
index.json
utterances.jsonl

What is the shortest path for converting this to useable training data?

Wes

Nov 08 '20 22:11 wmodes

chatbot-rnn chatbot-rnn copied to clipboard

Parse Reddit Corpus in zip format

chatbot-rnn
chatbot-rnn copied to clipboard