Code-Pile icon indicating copy to clipboard operation
Code-Pile copied to clipboard

Reddit

Open taisazero opened this issue 2 years ago • 5 comments

Programming & Computing Sub-Reddits

Dataset URL - awesome list of programming subreddits Code Pile Spreadsheet Another list of programming subreddits Thanks to @ncoop57!

Does the dataset exist in a scraped format ?

No, we need to format them into a dialogue format.

Description

Obtain data from Pushift Reddit using wgets/http requests from 2009-2022 and filter for programming-related subreddits.

Procedure

  • [x] Obtain data from Pushift Reddit from the years 2006-2022. We probably need to write a script that issues wgets for data dumps.
  • [x] Store data dump on a GCP Bucket.
  • [ ] Create 3 tables authors, submissions, and comments in BigQuery from the GCP Buckets.
  • [ ] Merge posts with reply chains and author metadata (specifically bio)
  • [ ] (Optionally) Filter for long dialogue chains following OPT
  • [x] Process Reddit threads (posts and replies) into a conversational form using this script
  • [x] Filter for programming subreddits in the list of subreddits. Then we process non-programming subreddits and programming subreddits separately.
  • [x] Process into output format {"text": string, "meta": obj}
  • [ ] Run dedup Min-Hash
  • [x] Run lm_format script

Final Data Format inside text

[Context]:
	"Learning to learn", using deep learning to design the architecture of another deep network: https://arxiv.org/abs/1606.04474
[Response]:
	using deep learning with SGD to design the learning algorithms of another deep network   *

Extra Contexts:
	[context/2]:
		Could someone there post a summary of the insightful moments.
	[context/1]:
		Basically L2L is the new deep learning.
	[context/0]:
		What's "L2L" mean?

Other features:
	[context_author]:
		goodside
	[response_author]:
		NetOrBrain
	[subreddit]:
		MachineLearning
	[thread_id]:
		5h6yvl

taisazero avatar Sep 15 '22 03:09 taisazero