arctic_shift icon indicating copy to clipboard operation
arctic_shift copied to clipboard

Does your data collection rescan and overwrite?

Open Zoher15 opened this issue 2 months ago • 3 comments

Hey @ArthurHeitmann ,

Thank you so much for putting together and sharing all of this Reddit data. I really appreciate the effort that goes into maintaining it.

I’m currently working on a research project about Reddit moderation. Up until now, I was using the Pushshift dumps from 2005-06 through 2023-02, focusing on moderator comments and their corresponding parent comments (the moderated comments).

However, I ran into a significant issue. Because of rescanning and overwriting, many controversial comments ended up being replaced with “[removed]” or “[deleted].” It seems that even though Pushshift was designed to capture data quickly, often within seconds, the later rescans about 24 hours afterward would overwrite the original content and erase valuable moderated discussions.

Before I invest time into working with your data dumps, could you clarify whether they follow a similar rescanning or overwriting process? Understanding this would help me determine whether your data dumps avoids the same issue.

Thanks again for your time and for making this resource available.

Zoher15 avatar Nov 02 '25 00:11 Zoher15

I can't say much for data from pushshift, which was collected up until 2023. I don't know their exact approach.

Data that came afterwards from arctic shift is closer to what you're looking for. Posts and comments are archived within a few seconds of their creation. And after 1.5 days the upvotes and some other data is updated. This approach will preserve the original texts.

The only exception is content that was removed by the auto moderator immediately after posting. It's not possible to view those comments, unless you're a moderator of that subreddit.

ArthurHeitmann avatar Nov 02 '25 14:11 ArthurHeitmann

@ArthurHeitmann Thanks that sounds promising!

Could you clarify how compliant this data collection is with Reddit’s policies? I’d like to be sure before using it for research, and it would help others who may want to reproduce my work.

Zoher15 avatar Nov 02 '25 18:11 Zoher15

If you wanted to be 100% compliant with reddits policies, you'd have to collect all the data yourself. However using 3rd party reddit datasets is pretty common in academic work

ArthurHeitmann avatar Nov 02 '25 18:11 ArthurHeitmann