Deleted posts
Seems to be no way currently to search for or exclude deleted posts. Example:
http://elasticsearch.pushshift.io?sort=created_utc:desc&q=(subreddit:stationalpha+AND+created_utc:%3C1455300000)
This excludes some but not all:
http://elasticsearch.pushshift.io?sort=created_utc:desc&q=(NOT+selftext:deleted+AND+subreddit:stationalpha+AND+created_utc:%3C1455300000)
There are a few different scenarios that we would need to be able to handle when it comes to deleted posts and comments.
- A comment / post can be deleted by the original author.
- A comment / post can be removed by a moderator of a subreddit.
- A comment / post can be automatically removed by a bot like automod.
The Pushshift API is constantly ingesting data from the Reddit /api/info endpoint by asking for one hundred objects at a time. Since there is such a high amount of volume, currently I ingest new comments and posts in near real-time. I also re-ingest posts a few hours later to update scores for submissions. However, at present, I'm not able to re-ingest comments after a few hours due to API limits. I am working on a plan to be able to do this in the future and I am hoping to implement it sometime in the near future.
That being said, I do agree with you in that perhaps it would be beneficial to create a boolean flag within Elasticsearch to mark if a comment or post was removed at some point. I need to check the Reddit API to test all of the various combinations on how something can be deleted.
If a moderator removes a comment, does the message body become [deleted] or [removed]? Searching the comment body for those may be a temporary solution.
Please feel free to add anything that I may have missed -- but what I will end up doing is creating a new comment and then removing it as a moderator to see what JSON parameters change when that happens.
Some might be, but I am not concerned with deleted comments. Here is my use case. Reddit allows you to see a subreddit history like so:
- http://reddit.com/r/stationalpha
- http://reddit.com/r/stationalpha?count=25&after=t3_6xvgc8
- http://reddit.com/r/stationalpha?count=50&after=t3_6uzt6b
- etc
and using this method deleted posts are automatically hidden, in fact i dont think you can unhide them with this method. however this method only returns the first 800 or so posts, after which you have no recourse but to use pushshift to see the older posts.
it would be nice if when doing this deleted posts were not seen either, either by default or with an option, so that the pushshift experience is closer to the reddit experience.
I see what you are saying. So basically adding a parameter like "show_deletes=false" or something to that effect? I think that is possible if there is a definitive parameter in the submission object that shows if it was, in fact, removed from the Reddit listing.
Here are few changes when you delete a post - I cant seem to use any of them as a filter currently:
-
thumbanilchanges todefault -
selftextchanges to[deleted] -
is_crosspostablechanges tofalse -
authorchanges to[deleted]
I think it might be worth raising this with the Reddit admins to get a new field added to specify whether it is visible or not on Reddit. The logic above could work, but I'm wondering if it will cover all the different scenarios.
I have reported the reddit issues
- http://reddit.com/r/changelog/comments/694o34/-/dnspta2
- http://reddit.com/r/help/comments/6rc3i1/-/dnspl8t
Hey Steven. What pushshift endpoints are you using?
On Nov 25, 2017 8:04 PM, "Steven Penny" [email protected] wrote:
To who it may concern - I will be moving my posts here:
http://headcycle.com/user/svnpenn
similar to reddit, but appears not to have the database issues
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pushshift/api/issues/4#issuecomment-346976687, or mute the thread https://github.com/notifications/unsubscribe-auth/AMBTfVM_KpiB0r3ZmeNBZTiELpmYnv5-ks5s6LkqgaJpZM4PhyfS .
When I complete the reindexing for ES 6.0 that will be possible.
On Nov 26, 2017 10:16 AM, "svnpenn2" [email protected] wrote:
@pushshift https://github.com/pushshift I am using the links shown in my original post - if my query is malformed, or some other issue please let me know - but currently seems to be no incantation via web browser to exclude deleted posts
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pushshift/api/issues/4#issuecomment-347015387, or mute the thread https://github.com/notifications/unsubscribe-auth/AMBTfZeI4i665VwLB-Ef-Jdqk_zIShZ5ks5s6YDZgaJpZM4PhyfS .
I made my own site as a workaround - I plan to use this until Reddit gets its act together:
http://svnpenn.github.io/mauve