snscrape icon indicating copy to clipboard operation
snscrape copied to clipboard

Reddit scraper returns no submissions before 2022-11-03

Open dringel opened this issue 2 years ago • 1 comments

Describe the bug

When I scrape reddits for November 2022 onward, I get 10 fields: [_type', 'author', 'body', 'date', 'id', 'parentId', 'subreddit', 'url', 'link', 'selftext', 'title']

When I scrape reddits before November 2022 (more than 3 months ago), I only get 7 fields: ['_type', 'author', 'body', 'date', 'id', 'parentId', 'subreddit', 'url']

I am missing the last 3 fields for reddits that date before November 2022: "title", "selftext", and "link"

When I scraped reddits last year (in early December), I got all 10 fields even for reddits from 10 years ago.

How to reproduce

October 2022 In:

 !snscrape --json --progress reddit-search --before 1667275199 --after 1664596800 Searchterm > October.json
 df = pd.read_json("October.json", lines=True)
 df.columns.tolist()

Out: ['_type', 'author', 'body', 'date', 'id', 'parentId', 'subreddit', 'url']

Nov 2022 In:

!snscrape --json --progress reddit-search --before 1669870799 --after 1667275200 Searchterm > November.json
df = pd.read_json("November.json", lines=True)
df.columns.tolist()

Out: ['_type', 'author', 'body', 'date', 'id', 'parentId', 'subreddit', 'url', 'link', 'selftext', 'title']

Expected behavior

All 10 fields are scraped regardless of reddit date.

Screenshots and recordings

No response

OS / Distro

Ubuntu 18.04.6 LTS

Output from snscrape --version

snscrape 0.5.0.20230113

Scraper

reddit-search

Backtrace

No response

Dump of locals

No response

How are you using snscrape?

CLI

Additional context

Called from python notebook. Same outcome if called directly from command line.

dringel avatar Jan 27 '23 14:01 dringel

The Reddit scrapers return submissions and comments unless instructed otherwise. The absence of those fields means that there are no submissions in the time frame (since they aren't present on comments).

Pushshift recently migrated to a new server and software stack. The old data apparently still hasn't been loaded into the new system, so submissions before 3 November are currently missing. This is mentioned e.g. here. There's nothing snscrape can do about this.

JustAnotherArchivist avatar Jan 27 '23 20:01 JustAnotherArchivist