Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Scraping Reddit dumps

Open yk opened this issue 2 years ago • 51 comments

Reddit could provide a good source for training data, especially since the tree-like structure allows for multiple continuations of a conversation, which is amenable to ranking. Probably, not every subreddit will be ideal, most will just result in "general conversations" but there might be some that are essentially in instruction-reply form, or question-answer form (like r/whatisthisthing).

  • [ ] come up with an initial list of promising subreddits that would result in good training data for OpenAssistant
  • [ ] write a parser that takes in a reddit dump and extracts conversations as trees

From Christoph:

basically the idea is : We have a graph with 1 root and many branches and leaves

  1. parse the graph from the jsons
  2. get the paths from the root to the leaves that have the most upvotes & make plain text from them ( we should not get alll, cause then the parts near to the root would have high repetiton ) https://files.pushshift.io/reddit/comments/ https://files.pushshift.io/reddit/comments/sample_data.json

yk avatar Dec 23 '22 08:12 yk

I think r/NoStupidQuestions , r/AskReddit , r/answers , r/ExplainLikeImFive and r/AskScience are really good for collecting this kind of data

SriPrarabdha avatar Dec 26 '22 06:12 SriPrarabdha

if this issue is not assigned to anyone , I would like to work on it

SriPrarabdha avatar Dec 26 '22 06:12 SriPrarabdha

I am also available to pick this one @SriPrarabdha. We could also work together?

Proteusiq avatar Dec 26 '22 15:12 Proteusiq

Hey, thanks a lot :) I've assigned both of you, feel free to work separately or together.

Remember, we're mainly interested in the scraping and parsing code and some instructions on how to run it all. We have infrastructure to do the data collection and storage, so not really a need on your side to do that part, it's really more about how to obtain and handle the data.

yk avatar Dec 26 '22 16:12 yk

@Proteusiq that sounds great! How do you want to get started with this?

SriPrarabdha avatar Dec 26 '22 17:12 SriPrarabdha

@Proteusiq that sounds great! How do you want to get started with this?

I have tomorrow. I could start with a prototype and add snippets here and we can see how to go about. What say you?

Proteusiq avatar Dec 26 '22 18:12 Proteusiq

Yeah for sure👍

@Proteusiq that sounds great! How do you want to get started with this?

I have tomorrow. I could start with a prototype and add snippets here and we can see how to go about. What say you?

SriPrarabdha avatar Dec 27 '22 12:12 SriPrarabdha

Path to getting data. I have tested with Postman: We can use requests or httpx Sessions

GET e.g.

https://api.pushshift.io/reddit/search/submission?subreddit=whatisthisthing&size=10

DATA can be gathered in time buckets with before and after params. I will upload a snippet code tomorrow

:exclamation: API params

Proteusiq avatar Dec 27 '22 15:12 Proteusiq

can both of you DM me somehow? discord, twitter, all good :) makes coordination easier

yk avatar Dec 27 '22 15:12 yk

can both of you DM me somehow? discord, twitter, all good :) makes coordination easier

Alrighty👍

SriPrarabdha avatar Dec 27 '22 17:12 SriPrarabdha

@SriPrarabdha can you collect initial list of subreddits?

Proteusiq avatar Dec 27 '22 18:12 Proteusiq

I've already shared some the subreddits that we can use and will update if I find some new ones

SriPrarabdha avatar Dec 27 '22 19:12 SriPrarabdha

These ones:

r/NoStupidQuestions 
r/AskReddit
r/answers
r/ExplainLikeImFive
r/AskScience

?

Proteusiq avatar Dec 27 '22 19:12 Proteusiq

Yeah these one

These ones:

r/NoStupidQuestions 
r/AskReddit
r/answers
r/ExplainLikeImFive
r/AskScience

?

SriPrarabdha avatar Dec 28 '22 15:12 SriPrarabdha

I have collected initial data in JSON Form while preserving the graph structure of these comments. How should I share it with you guys to have a look!

SriPrarabdha avatar Dec 28 '22 15:12 SriPrarabdha

I have collected initial data in JSON Form while preserving the graph structure of these comments. How should I share it with you guys to have a look!

upload here or discord.

do you have code for this somewhere in a fork?

yk avatar Dec 28 '22 16:12 yk

I have put together the code and JSON file in this repo https://github.com/SriPrarabdha/Reddit-Scrapper But the main problem is parsing one post on a subreddit with 15K comments took around 25 minutes. So even scrapping 1 subreddit completely will take a long time

SriPrarabdha avatar Dec 29 '22 05:12 SriPrarabdha

@SriPrarabdha I think you are after something. We can always make the scrapper faster. Update on https://api.pushshift.io/reddit/comments/

import pandas as pd
from httpx import Client

HEADERS = {"User-Agent": "Prayson W. Daniel <[email protected]>"}
BASE_URI = "https://api.pushshift.io/reddit"


timeout = 60 # seconds
subreddit = "whatisthisthing"
size = 10
score = 20
num_comments = 10 # has no effect

with Client(base_url=BASE_URI, headers=HEADERS) as request:
    
    print("Fetching submission")
    s = request.get(url="/search/submission",
                    params=params,
                    timeout=timeout)
    
    print("Fetching comments")
    _ids = ",".join(item.get('id') for item in s.json().get("data"))
    params.update({"ids":_ids})
    c = request.get(url="/search/comment",
                    params=params,
                    timeout=timeout)
                    

# Return only needed columns with `fields`
# merge the submission to the comments

datac = pd.DataFrame(c.json().get('data'))
datas = pd.DataFrame(s.json().get('data'))

I will try downloading files instead from https://files.pushshift.io.

The are huge: RC 2022-10 => 23.8 GB and RS => 9.5.

Proteusiq avatar Dec 29 '22 12:12 Proteusiq

@yk and @SriPrarabdha: Updates on files: It is possible to get data offline: I download RC and RS files for tests. This is where I am:

import json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen


def smart_open(file_path: Path) -> Generator[str]:
    """
    Use:
    ```python
    import json
    from pathlib import Path
    
    blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
    needed = {blob.get("needed") for blob in blobs}
    ```
    """
    DCTX = ZstdDecompressor(max_window_size=2**31)
    with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
        for blob in f:
            yield blob
            
DATA_DIR = Path("../data")
submission_objects = smart_open(DATA_DIR / "RS_2022-10.zst")
submission_blobs = map(json.loads, submission_objects)

subreddit = "whatisthisthing"
num_comments = 10 

# working on finding a faster or better way to do this
datas_gen  = (blob for blob in blobs 
         if (blob["subreddit"] == subreddit and 
             blob["num_comments"] >= num_comments)
)

data = pd.DataFrame(datas_gen)

The idea is to get ids and questions from the submission and their comments from comments. Merge and groupby id order by reply time on the comments.

Proteusiq avatar Dec 29 '22 13:12 Proteusiq

looks pretty neat so far, nice work! is there a chance we could use something like typer or so, to make this into a script that takes flags to define things like data location etc?

yk avatar Dec 29 '22 14:12 yk

Guys, do you need help to speed up parsing? I can step in and try to help you.

doroshroman avatar Dec 29 '22 14:12 doroshroman

Guys, do you need help speeding up parsing? I can step in and try to help you.

Parsing is not needed as the data is in JSON (python dictionary) but accessing what we need is needed. Have you worked with hyperjson or orjson?

@yk Yes, we can make a beautiful CLI wrapper. What I have now are just prototypes

Proteusiq avatar Dec 29 '22 14:12 Proteusiq

Parsing is not needed as the data is in JSON (python dictionary) but accessing what we need is needed. Have you worked with hyperjson or orjson?

Actually, didn't have a chance to work with these libraries. But, It's not late to learn something new

doroshroman avatar Dec 29 '22 14:12 doroshroman

Also, what kind of trees you want to build from the json representations?

doroshroman avatar Dec 29 '22 14:12 doroshroman

Also, what kind of trees you want to build from the json representations?

Something like: id "ABC", submission: "What happened to Batman?" In comments, we fetch comments where id = "ABC" sort the comments by time of reply

 id "ABC", submission: "What happened to Batman?"  Time 10:30
 id "ABC", comment: "Because Catwoman happened" Time 10:45
 id "ABC", comment: "No way" Time 10:46

So we have replay as they come in. The tree is from submission -> earliers_comments

Sometimes the comments can branch out to others own comments ...

Updates: Using generator allows me to keep calling and stoping using Jupyter: Getting submission is fast but matching them to comment takes forever

# instead of json
import orjson as json
...

break_point = 100
datas_list = [] 
for blob in blobs:
    if break_point < 0:
        break
    
    if (blob["subreddit"] == subreddit and 
        blob["num_comments"] >= num_comments):
        print(".", end="")
        break_point -= 1
        datas_list.append(blob)
 
 ids = set(b.get("id") for b in datas_list)
print(f"number of {ids=}")

com_objects = smart_open(DATA_DIR / "RC_2022-10.zst")
blobc = map(json.loads, com_objects)

## just to see how long it takes to get 10 match :(
break_point = 10
datac_list = [] 
for blob in blobc:
    if blob["subreddit"] != subreddit:
        continue
    
    if break_point < 0:
        break
    print(".", end="")
    if blob["id"] in ids:
        print("X", end="")
        break_point -= 1
        datac_list.append(blob)
...

could be I am matching on the wrong things. Maybe in the comments, I need parent_id. I will keep one searching

Proteusiq avatar Dec 29 '22 18:12 Proteusiq

I can write the multiprocessing version of this, which can speed up matching, just attach full file with code

doroshroman avatar Dec 29 '22 18:12 doroshroman

I can write the multiprocessing version of this, which can speed up matching, just attach full file with code

Super! I got it working now. In submission, I needed "name", and in comments "parent_id"

Notes: prints are just for debugging… needs to be removed

Full code


import orjson as json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen


def smart_open(file_path: Path) -> Generator[str]:
    """
    Use:
    ```python
    import orjson as json
    from pathlib import Path
    
    blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
    needed = {blob.get("needed") for blob in blobs}
    ```
    """
    DCTX = ZstdDecompressor(max_window_size=2**31)
    with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
        for blob in f:
            yield blob
            
DATA_DIR = Path("../data")
submission_objects = smart_open(DATA_DIR / "RS_2022-10.zst")
comment_objects = smart_open(DATA_DIR / "RC_2022-10.zst")

submission_blobs = map(json.loads, submission_objects)
comment_blobs = map(json.loads, comment_objects)

# params
subreddit = "whatisthisthing"
num_comments = 10 

# get 101 submissions with num_comments >= 10
break_point = 100
datas_list = [] 
for blob in submission_blobs:
    if break_point < 0:
        break
    
    if (blob["subreddit"] == subreddit and 
        blob["num_comments"] >= num_comments):
        print(".", end="")
        break_point -= 1
        datas_list.append(blob)

# get the ids
ids = set(b.get("name") for b in datas_list)
print(f"we have {len(ids)} unique ids"}

# this takes long just to get 10
break_point = 10
datac_list = [] 
for blob in comment_blobs:
    if blob["subreddit"] != subreddit:
        continue
    
    if break_point < 0:
        break
    if blob["parent_id"] in ids:
        print(".", end="")
        break_point -= 1
        datac_list.append(blob)

# merging of data ...

Proteusiq avatar Dec 29 '22 18:12 Proteusiq

From a previous project of mine I have all the reddit comments and submissions on pushshift from 2005-12 to 2021-06 stored on a local server, as well as some code to scrape it. It may be easier for me to scrape the data locally and submit it as a json. The code I have is originally adapted from DialoGPT's reddit extractor, it may be helpful to give it a look. https://github.com/microsoft/DialoGPT

danielpwarren avatar Dec 29 '22 19:12 danielpwarren

From a previous project of mine I have all the reddit comments and submissions on pushshift from 2005-12 to 2021-06 stored on a local server, as well as some code to scrape it. It may be easier for me to scrape the data locally and submit it as a json. The code I have is originally adapted from DialoGPT's reddit extractor, it may be helpful to give it a look. https://github.com/microsoft/DialoGPT

That would be perfect 😍: looks like we are reinventing the wheel https://github.com/microsoft/DialoGPT/blob/master/reddit_extractor/src/reddit.py

Proteusiq avatar Dec 29 '22 19:12 Proteusiq

import orjson as json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen
import asyncio
from asyncio.events import AbstractEventLoop
from concurrent.futures import ProcessPoolExecutor
from functools import partial
from itertools import tee


def smart_open(file_path: Path) -> Generator[str]:
    """
    Use:
    ```python
    import orjson as json
    from pathlib import Path
    
    blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
    needed = {blob.get("needed") for blob in blobs}
    ```
    """
    DCTX = ZstdDecompressor(max_window_size=2**31)
    with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
        for blob in f:
            yield blob


def filter_submissions(submission_blobs, subreddit, num_comments):
# get 101 submissions with num_comments >= 10
    break_point = 100
    datas_list = [] 
    for blob in submission_blobs:
        if break_point < 0:
            break
        
        if (blob["subreddit"] == subreddit and 
            blob["num_comments"] >= num_comments):
            print(".", end="")
            break_point -= 1
            datas_list.append(blob)

    # get the ids
    ids = set(b.get("name") for b in datas_list)
    print(f"we have {len(ids)} unique ids")
    
    return ids


#this takes long just to get 10
def matching(comments_chunk, ids, subreddit):
    break_point = 10
    datac_list = [] 
    for blob in comments_chunk:
        if blob["subreddit"] != subreddit:
            continue
        
        if break_point < 0:
            break
        if blob["parent_id"] in ids:
            print(".", end="")
            break_point -= 1
            datac_list.append(blob)
            
    return datac_list


def generate_chunk(iterable, chunk_len=100):
    chunk = []
    for i, item in enumerate(iterable):
        if i % chunk_len == 0:
            yield chunk
            chunk = []
        chunk.append(item)


async def main(ids, subbredit):
    with ProcessPoolExecutor() as process_pool:
        loop: AbstractEventLoop = asyncio.get_running_loop()
        calls = [partial(matching, comment_chunk, ids, subbredit) for comment_chunk in generate_chunk(comment_blobs_copy)]
        call_coros = []
        
        
        for call in calls:
            call_coros.append(loop.run_in_executor(process_pool, call))
            
        results = await asyncio.gather(*call_coros)
        
        merged_result = []
        for chunk_result in results:
            merged_result += chunk_result
            
    return merged_result


if __name__ == '__main__':
    DATA_DIR = Path("./data") #Path("../data")
    submission_objects, comment_objects, comment_objects_copy = tee(smart_open(DATA_DIR / "RC_2009-04.zst"), 3)

    submission_blobs = map(json.loads, submission_objects)
    comment_blobs = map(json.loads, comment_objects)
    comment_blobs_copy = map(json.loads, comment_objects_copy)

    # params
    subreddit = "whatisthisthing"
    num_comments = 10
    
    ids = filter_submissions(submission_blobs, subreddit, num_comments)

    matched_comments = asyncio.run(main(ids, subreddit))
    print(matched_comments)
        

doroshroman avatar Dec 29 '22 19:12 doroshroman