Code-Pile
Code-Pile copied to clipboard
Programming & Computing Sub-Reddits
Dataset URL - awesome list of programming subreddits Code Pile Spreadsheet Another list of programming subreddits Thanks to @ncoop57!
Does the dataset exist in a scraped format ?
No, we need to format them into a dialogue format.
Description
Obtain data from Pushift Reddit using wgets/http requests from 2009-2022 and filter for programming-related subreddits.
Procedure
- [x] Obtain data from Pushift Reddit from the years 2006-2022. We probably need to write a script that issues wgets for data dumps.
- [x] Store data dump on a GCP Bucket.
- [ ] Create 3 tables authors, submissions, and comments in BigQuery from the GCP Buckets.
- [ ] Merge posts with reply chains and author metadata (specifically bio)
- [ ] (Optionally) Filter for long dialogue chains following OPT
- [x] Process Reddit threads (posts and replies) into a conversational form using this script
- [x] Filter for programming subreddits in the list of subreddits. Then we process non-programming subreddits and programming subreddits separately.
- [x] Process into output format
{"text": string, "meta": obj}
- [ ] Run dedup Min-Hash
- [x] Run
lm_format
script
Final Data Format inside text
[Context]:
"Learning to learn", using deep learning to design the architecture of another deep network: https://arxiv.org/abs/1606.04474
[Response]:
using deep learning with SGD to design the learning algorithms of another deep network *
Extra Contexts:
[context/2]:
Could someone there post a summary of the insightful moments.
[context/1]:
Basically L2L is the new deep learning.
[context/0]:
What's "L2L" mean?
Other features:
[context_author]:
goodside
[response_author]:
NetOrBrain
[subreddit]:
MachineLearning
[thread_id]:
5h6yvl