pmaw
pmaw copied to clipboard
Queries that specify `before` and `after` can return a different number of results than reported as available by Pushshift
Test Query:
comments = api.search_comments(
after=1606262347,
before=1618581599,
subreddit="CovidVaccinated",
fields=["id","subreddit","link_id","parent_id","is_submitter","author",
"author_fullname","body","score","created_utc","permalink"],
limit=None
)
Results:
40730 result(s) available in Pushshift
Checkpoint:: Success Rate: 71.00% - Requests: 100 - Batches: 10 - Items Remaining: 33898
Checkpoint:: Success Rate: 79.00% - Requests: 200 - Batches: 20 - Items Remaining: 25661
Checkpoint:: Success Rate: 81.67% - Requests: 300 - Batches: 30 - Items Remaining: 18163
Checkpoint:: Success Rate: 81.75% - Requests: 400 - Batches: 40 - Items Remaining: 11467
Checkpoint:: Success Rate: 82.80% - Requests: 500 - Batches: 50 - Items Remaining: 4262
Checkpoint:: Success Rate: 83.02% - Requests: 583 - Batches: 60 - Items Remaining: 1
Total:: Success Rate: 83.02% - Requests: 583 - Batches: 60 - Items Remaining: 1
1 result(s) not found in Pushshift
Discovered in #12
A potential cause could be how the database is queried during time slicing. The oldest item utc_timestamp
is used as a before
field when generating subsequent timeslices. Pushshift queries the database using gt
and lt
for the after
and before
timestamps.
If multiple items have the same exact same utc_timestamp
but are not all returned in a single query (due to 100 item limit), we might expect that the items may not be returned in subsequent timeslices.