Support for retrieving live data from Reddit rather than using the Pushshift data
Hi Is there a way with snscrape to get the reddit comment count per submission and the submission/comment score?
Thanks for this tool! It's so useful
No, unfortunately that's currently not possible. The Reddit scraper uses Pushshift because Reddit's own endpoints have ridiculous limitations (for example, there's a hard limit of 1000 submissions on subreddit/user lists, even through the API), and Pushshift can't have accurate counts/scores because it only updates the data periodically (if at all) after the initial fetch shortly after the submission/comment was published. Fetching this information from Reddit at scrape time isn't easy either: without authentication (#270), you'd quickly run into rate limits. So I'm not sure there is a good way to implement this.
I figured that might be the case because I was reading the rate limits and they're pretty tight.
I wonder if it's possible to look at what I've scraped already using snscrape and then extract the IDs for specific posts of interest and then go back to Reddit and see if for those submissions/comments if I can programmatically scrape the comment count and score at a point in time using the naked API and just report my results as "this is what it is at this point in time." But I suspect the rate limits would make this a looooong and painful endeavor.
Thank you for making the time to reply. Truly appreciate it.
Yeah, that would certainly be possible and is likely the most reasonable approach. The /api/info endpoint accepts 100 IDs at once, I believe, which should help with the rate limits.
This works, for example, to fetch the current data from Reddit for the 10 most recent submissions/comments on /r/ArchiveTeam:
import itertools
import requests
import snscrape.modules.reddit
scraper = snscrape.modules.reddit.RedditSubredditScraper('ArchiveTeam')
items = list(itertools.islice(scraper.get_items(), 0, 10))
ids = [item.id for item in items]
r = requests.get(f'https://old.reddit.com/api/info.json?id={",".join(ids)}')
print(r.status_code)
print(r.json())
Then there'd have to be some code for merging items with the Reddit data. Plus rate limit handling and some other things. PRAW would of course already handle all of that but requires authentication, I believe.
I recently thought about this a bit more. The reasonable thing to do here is to use Pushshift for ID retrieval and then collect the data from Reddit, discarding everything from Pushshift. However, this should be optional.
So I'm turning this issue into a feature request for a new group of scrapers which use live rather than Pushshift data; not just for scores but also for anything else of interest that Reddit might return.