Open-Assistant
Open-Assistant copied to clipboard
Scraping Reddit dumps
Reddit could provide a good source for training data, especially since the tree-like structure allows for multiple continuations of a conversation, which is amenable to ranking. Probably, not every subreddit will be ideal, most will just result in "general conversations" but there might be some that are essentially in instruction-reply form, or question-answer form (like r/whatisthisthing).
- [ ] come up with an initial list of promising subreddits that would result in good training data for OpenAssistant
- [ ] write a parser that takes in a reddit dump and extracts conversations as trees
From Christoph:
basically the idea is : We have a graph with 1 root and many branches and leaves
- parse the graph from the jsons
- get the paths from the root to the leaves that have the most upvotes & make plain text from them ( we should not get alll, cause then the parts near to the root would have high repetiton ) https://files.pushshift.io/reddit/comments/ https://files.pushshift.io/reddit/comments/sample_data.json
I think r/NoStupidQuestions , r/AskReddit , r/answers , r/ExplainLikeImFive and r/AskScience are really good for collecting this kind of data
if this issue is not assigned to anyone , I would like to work on it
I am also available to pick this one @SriPrarabdha. We could also work together?
Hey, thanks a lot :) I've assigned both of you, feel free to work separately or together.
Remember, we're mainly interested in the scraping and parsing code and some instructions on how to run it all. We have infrastructure to do the data collection and storage, so not really a need on your side to do that part, it's really more about how to obtain and handle the data.
@Proteusiq that sounds great! How do you want to get started with this?
@Proteusiq that sounds great! How do you want to get started with this?
I have tomorrow. I could start with a prototype and add snippets here and we can see how to go about. What say you?
Yeah for sure👍
@Proteusiq that sounds great! How do you want to get started with this?
I have tomorrow. I could start with a prototype and add snippets here and we can see how to go about. What say you?
Path to getting data. I have tested with Postman: We can use requests or httpx Sessions
GET e.g.
https://api.pushshift.io/reddit/search/submission?subreddit=whatisthisthing&size=10
DATA can be gathered in time buckets with before and after params. I will upload a snippet code tomorrow
:exclamation: API params
can both of you DM me somehow? discord, twitter, all good :) makes coordination easier
can both of you DM me somehow? discord, twitter, all good :) makes coordination easier
Alrighty👍
@SriPrarabdha can you collect initial list of subreddits?
I've already shared some the subreddits that we can use and will update if I find some new ones
These ones:
r/NoStupidQuestions
r/AskReddit
r/answers
r/ExplainLikeImFive
r/AskScience
?
Yeah these one
These ones:
r/NoStupidQuestions r/AskReddit r/answers r/ExplainLikeImFive r/AskScience?
I have collected initial data in JSON Form while preserving the graph structure of these comments. How should I share it with you guys to have a look!
I have collected initial data in JSON Form while preserving the graph structure of these comments. How should I share it with you guys to have a look!
upload here or discord.
do you have code for this somewhere in a fork?
I have put together the code and JSON file in this repo https://github.com/SriPrarabdha/Reddit-Scrapper But the main problem is parsing one post on a subreddit with 15K comments took around 25 minutes. So even scrapping 1 subreddit completely will take a long time
@SriPrarabdha I think you are after something. We can always make the scrapper faster. Update on https://api.pushshift.io/reddit/comments/
import pandas as pd
from httpx import Client
HEADERS = {"User-Agent": "Prayson W. Daniel <[email protected]>"}
BASE_URI = "https://api.pushshift.io/reddit"
timeout = 60 # seconds
subreddit = "whatisthisthing"
size = 10
score = 20
num_comments = 10 # has no effect
with Client(base_url=BASE_URI, headers=HEADERS) as request:
print("Fetching submission")
s = request.get(url="/search/submission",
params=params,
timeout=timeout)
print("Fetching comments")
_ids = ",".join(item.get('id') for item in s.json().get("data"))
params.update({"ids":_ids})
c = request.get(url="/search/comment",
params=params,
timeout=timeout)
# Return only needed columns with `fields`
# merge the submission to the comments
datac = pd.DataFrame(c.json().get('data'))
datas = pd.DataFrame(s.json().get('data'))
I will try downloading files instead from https://files.pushshift.io.
The are huge: RC 2022-10 => 23.8 GB and RS => 9.5.
@yk and @SriPrarabdha: Updates on files: It is possible to get data offline: I download RC and RS files for tests. This is where I am:
import json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen
def smart_open(file_path: Path) -> Generator[str]:
"""
Use:
```python
import json
from pathlib import Path
blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
needed = {blob.get("needed") for blob in blobs}
```
"""
DCTX = ZstdDecompressor(max_window_size=2**31)
with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
for blob in f:
yield blob
DATA_DIR = Path("../data")
submission_objects = smart_open(DATA_DIR / "RS_2022-10.zst")
submission_blobs = map(json.loads, submission_objects)
subreddit = "whatisthisthing"
num_comments = 10
# working on finding a faster or better way to do this
datas_gen = (blob for blob in blobs
if (blob["subreddit"] == subreddit and
blob["num_comments"] >= num_comments)
)
data = pd.DataFrame(datas_gen)
The idea is to get ids and questions from the submission and their comments from comments. Merge and groupby id order by reply time on the comments.
looks pretty neat so far, nice work! is there a chance we could use something like typer or so, to make this into a script that takes flags to define things like data location etc?
Guys, do you need help to speed up parsing? I can step in and try to help you.
Guys, do you need help speeding up parsing? I can step in and try to help you.
Parsing is not needed as the data is in JSON (python dictionary) but accessing what we need is needed. Have you worked with hyperjson or orjson?
@yk Yes, we can make a beautiful CLI wrapper. What I have now are just prototypes
Parsing is not needed as the data is in JSON (python dictionary) but accessing what we need is needed. Have you worked with hyperjson or orjson?
Actually, didn't have a chance to work with these libraries. But, It's not late to learn something new
Also, what kind of trees you want to build from the json representations?
Also, what kind of trees you want to build from the json representations?
Something like:
id "ABC", submission: "What happened to Batman?"
In comments, we fetch comments where id = "ABC"
sort the comments by time of reply
id "ABC", submission: "What happened to Batman?" Time 10:30
id "ABC", comment: "Because Catwoman happened" Time 10:45
id "ABC", comment: "No way" Time 10:46
So we have replay as they come in. The tree is from submission -> earliers_comments
Sometimes the comments can branch out to others own comments ...
Updates: Using generator allows me to keep calling and stoping using Jupyter: Getting submission is fast but matching them to comment takes forever
# instead of json
import orjson as json
...
break_point = 100
datas_list = []
for blob in blobs:
if break_point < 0:
break
if (blob["subreddit"] == subreddit and
blob["num_comments"] >= num_comments):
print(".", end="")
break_point -= 1
datas_list.append(blob)
ids = set(b.get("id") for b in datas_list)
print(f"number of {ids=}")
com_objects = smart_open(DATA_DIR / "RC_2022-10.zst")
blobc = map(json.loads, com_objects)
## just to see how long it takes to get 10 match :(
break_point = 10
datac_list = []
for blob in blobc:
if blob["subreddit"] != subreddit:
continue
if break_point < 0:
break
print(".", end="")
if blob["id"] in ids:
print("X", end="")
break_point -= 1
datac_list.append(blob)
...
could be I am matching on the wrong things. Maybe in the comments, I need parent_id. I will keep one searching
I can write the multiprocessing version of this, which can speed up matching, just attach full file with code
I can write the multiprocessing version of this, which can speed up matching, just attach full file with code
Super! I got it working now. In submission, I needed "name", and in comments "parent_id"
Notes: prints are just for debugging… needs to be removed
Full code
import orjson as json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen
def smart_open(file_path: Path) -> Generator[str]:
"""
Use:
```python
import orjson as json
from pathlib import Path
blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
needed = {blob.get("needed") for blob in blobs}
```
"""
DCTX = ZstdDecompressor(max_window_size=2**31)
with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
for blob in f:
yield blob
DATA_DIR = Path("../data")
submission_objects = smart_open(DATA_DIR / "RS_2022-10.zst")
comment_objects = smart_open(DATA_DIR / "RC_2022-10.zst")
submission_blobs = map(json.loads, submission_objects)
comment_blobs = map(json.loads, comment_objects)
# params
subreddit = "whatisthisthing"
num_comments = 10
# get 101 submissions with num_comments >= 10
break_point = 100
datas_list = []
for blob in submission_blobs:
if break_point < 0:
break
if (blob["subreddit"] == subreddit and
blob["num_comments"] >= num_comments):
print(".", end="")
break_point -= 1
datas_list.append(blob)
# get the ids
ids = set(b.get("name") for b in datas_list)
print(f"we have {len(ids)} unique ids"}
# this takes long just to get 10
break_point = 10
datac_list = []
for blob in comment_blobs:
if blob["subreddit"] != subreddit:
continue
if break_point < 0:
break
if blob["parent_id"] in ids:
print(".", end="")
break_point -= 1
datac_list.append(blob)
# merging of data ...
From a previous project of mine I have all the reddit comments and submissions on pushshift from 2005-12 to 2021-06 stored on a local server, as well as some code to scrape it. It may be easier for me to scrape the data locally and submit it as a json. The code I have is originally adapted from DialoGPT's reddit extractor, it may be helpful to give it a look. https://github.com/microsoft/DialoGPT
From a previous project of mine I have all the reddit comments and submissions on pushshift from 2005-12 to 2021-06 stored on a local server, as well as some code to scrape it. It may be easier for me to scrape the data locally and submit it as a json. The code I have is originally adapted from DialoGPT's reddit extractor, it may be helpful to give it a look. https://github.com/microsoft/DialoGPT
That would be perfect 😍: looks like we are reinventing the wheel https://github.com/microsoft/DialoGPT/blob/master/reddit_extractor/src/reddit.py
import orjson as json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen
import asyncio
from asyncio.events import AbstractEventLoop
from concurrent.futures import ProcessPoolExecutor
from functools import partial
from itertools import tee
def smart_open(file_path: Path) -> Generator[str]:
"""
Use:
```python
import orjson as json
from pathlib import Path
blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
needed = {blob.get("needed") for blob in blobs}
```
"""
DCTX = ZstdDecompressor(max_window_size=2**31)
with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
for blob in f:
yield blob
def filter_submissions(submission_blobs, subreddit, num_comments):
# get 101 submissions with num_comments >= 10
break_point = 100
datas_list = []
for blob in submission_blobs:
if break_point < 0:
break
if (blob["subreddit"] == subreddit and
blob["num_comments"] >= num_comments):
print(".", end="")
break_point -= 1
datas_list.append(blob)
# get the ids
ids = set(b.get("name") for b in datas_list)
print(f"we have {len(ids)} unique ids")
return ids
#this takes long just to get 10
def matching(comments_chunk, ids, subreddit):
break_point = 10
datac_list = []
for blob in comments_chunk:
if blob["subreddit"] != subreddit:
continue
if break_point < 0:
break
if blob["parent_id"] in ids:
print(".", end="")
break_point -= 1
datac_list.append(blob)
return datac_list
def generate_chunk(iterable, chunk_len=100):
chunk = []
for i, item in enumerate(iterable):
if i % chunk_len == 0:
yield chunk
chunk = []
chunk.append(item)
async def main(ids, subbredit):
with ProcessPoolExecutor() as process_pool:
loop: AbstractEventLoop = asyncio.get_running_loop()
calls = [partial(matching, comment_chunk, ids, subbredit) for comment_chunk in generate_chunk(comment_blobs_copy)]
call_coros = []
for call in calls:
call_coros.append(loop.run_in_executor(process_pool, call))
results = await asyncio.gather(*call_coros)
merged_result = []
for chunk_result in results:
merged_result += chunk_result
return merged_result
if __name__ == '__main__':
DATA_DIR = Path("./data") #Path("../data")
submission_objects, comment_objects, comment_objects_copy = tee(smart_open(DATA_DIR / "RC_2009-04.zst"), 3)
submission_blobs = map(json.loads, submission_objects)
comment_blobs = map(json.loads, comment_objects)
comment_blobs_copy = map(json.loads, comment_objects_copy)
# params
subreddit = "whatisthisthing"
num_comments = 10
ids = filter_submissions(submission_blobs, subreddit, num_comments)
matched_comments = asyncio.run(main(ids, subreddit))
print(matched_comments)