api icon indicating copy to clipboard operation
api copied to clipboard

Deleted posts

Open ghost opened this issue 8 years ago • 5 comments

Seems to be no way currently to search for or exclude deleted posts. Example:

http://elasticsearch.pushshift.io?sort=created_utc:desc&q=(subreddit:stationalpha+AND+created_utc:%3C1455300000)

This excludes some but not all:

http://elasticsearch.pushshift.io?sort=created_utc:desc&q=(NOT+selftext:deleted+AND+subreddit:stationalpha+AND+created_utc:%3C1455300000)

ghost avatar Sep 24 '17 06:09 ghost

There are a few different scenarios that we would need to be able to handle when it comes to deleted posts and comments.

  1. A comment / post can be deleted by the original author.
  2. A comment / post can be removed by a moderator of a subreddit.
  3. A comment / post can be automatically removed by a bot like automod.

The Pushshift API is constantly ingesting data from the Reddit /api/info endpoint by asking for one hundred objects at a time. Since there is such a high amount of volume, currently I ingest new comments and posts in near real-time. I also re-ingest posts a few hours later to update scores for submissions. However, at present, I'm not able to re-ingest comments after a few hours due to API limits. I am working on a plan to be able to do this in the future and I am hoping to implement it sometime in the near future.

That being said, I do agree with you in that perhaps it would be beneficial to create a boolean flag within Elasticsearch to mark if a comment or post was removed at some point. I need to check the Reddit API to test all of the various combinations on how something can be deleted.

If a moderator removes a comment, does the message body become [deleted] or [removed]? Searching the comment body for those may be a temporary solution.

Please feel free to add anything that I may have missed -- but what I will end up doing is creating a new comment and then removing it as a moderator to see what JSON parameters change when that happens.

pushshift avatar Sep 24 '17 08:09 pushshift

Some might be, but I am not concerned with deleted comments. Here is my use case. Reddit allows you to see a subreddit history like so:

  • http://reddit.com/r/stationalpha
  • http://reddit.com/r/stationalpha?count=25&after=t3_6xvgc8
  • http://reddit.com/r/stationalpha?count=50&after=t3_6uzt6b
  • etc

and using this method deleted posts are automatically hidden, in fact i dont think you can unhide them with this method. however this method only returns the first 800 or so posts, after which you have no recourse but to use pushshift to see the older posts.

it would be nice if when doing this deleted posts were not seen either, either by default or with an option, so that the pushshift experience is closer to the reddit experience.

ghost avatar Sep 24 '17 12:09 ghost

I see what you are saying. So basically adding a parameter like "show_deletes=false" or something to that effect? I think that is possible if there is a definitive parameter in the submission object that shows if it was, in fact, removed from the Reddit listing.

pushshift avatar Oct 03 '17 06:10 pushshift

Here are few changes when you delete a post - I cant seem to use any of them as a filter currently:

  • thumbanil changes to default
  • selftext changes to [deleted]
  • is_crosspostable changes to false
  • author changes to [deleted]

ghost avatar Oct 03 '17 11:10 ghost

I think it might be worth raising this with the Reddit admins to get a new field added to specify whether it is visible or not on Reddit. The logic above could work, but I'm wondering if it will cover all the different scenarios.

pushshift avatar Oct 03 '17 12:10 pushshift

I have reported the reddit issues

  • http://reddit.com/r/changelog/comments/694o34/-/dnspta2
  • http://reddit.com/r/help/comments/6rc3i1/-/dnspl8t

ghost avatar Nov 26 '17 01:11 ghost

Hey Steven. What pushshift endpoints are you using?

On Nov 25, 2017 8:04 PM, "Steven Penny" [email protected] wrote:

To who it may concern - I will be moving my posts here:

http://headcycle.com/user/svnpenn

similar to reddit, but appears not to have the database issues

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pushshift/api/issues/4#issuecomment-346976687, or mute the thread https://github.com/notifications/unsubscribe-auth/AMBTfVM_KpiB0r3ZmeNBZTiELpmYnv5-ks5s6LkqgaJpZM4PhyfS .

pushshift avatar Nov 26 '17 15:11 pushshift

When I complete the reindexing for ES 6.0 that will be possible.

On Nov 26, 2017 10:16 AM, "svnpenn2" [email protected] wrote:

@pushshift https://github.com/pushshift I am using the links shown in my original post - if my query is malformed, or some other issue please let me know - but currently seems to be no incantation via web browser to exclude deleted posts

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pushshift/api/issues/4#issuecomment-347015387, or mute the thread https://github.com/notifications/unsubscribe-auth/AMBTfZeI4i665VwLB-Ef-Jdqk_zIShZ5ks5s6YDZgaJpZM4PhyfS .

pushshift avatar Nov 26 '17 15:11 pushshift

I made my own site as a workaround - I plan to use this until Reddit gets its act together:

http://svnpenn.github.io/mauve

ghost avatar Apr 16 '18 05:04 ghost