auto-archiver icon indicating copy to clipboard operation
auto-archiver copied to clipboard

Archive links from the Discord server

Open cormacrelf opened this issue 2 years ago • 1 comments

Giancarlo spoke to the Bellingcat Community Discord yesterday, including about how Bellingcat has an auto-archiver that works with links dropped manually into a Google Sheet. This seems like it could be easily extended to auto-archive any link posted in specific channels on the Discord. This might also turn out to be a faster way to use the auto-archiver, for any researchers who are using Discord in whatever capacity. The idea is that it gobbles up any URL posted anywhere in a message in an entire channel that matches one of the configured archivers.

If you want this and it wouldn't cost too much to run on specific channels, then I'm happy to build it. There are a few options for implementation that someone might be able to provide some guidance on.

Bot vs batch

I've started creating a bot. I figure that's probably better in terms of not exceeding the Discord API limits by doing a "get everything on this channel" frequently, but depending on how you guys like to run these archivers, there may be disadvantages in that it kinda has to stay running all the time. I suppose there'd be no harm in doing both, a bot that on startup reads the channel histories up to a max # of messages, and then waits quietly for new messages. Where do you stand on that?

How to trigger the actual archiving

You could:

  • Add it to a Google Sheet, and that's it. Let the existing scheduled archiver take care of any added URLs. A fair bit simpler.
  • On message receive with a link in it, add it to the sheet and schedule an archive to occur in the same discord_archive.py program. This is cooler because then a bot could whack a little react emoji on messages to indicate the archive status.

Deduplication

Considering this will be adding a bunch of new links to the archives, I would be worried about whether it's going to clobber previously archived pages in the S3 backend. This is something the archivers themselves are meant to detect, right? I don't think the Twitter one does this. Does DigitalOcean's S3 support version history, just in case? And the archivers don't overwrite anything if they hit a 404, right?


https://user-images.githubusercontent.com/378760/160227247-cb770ca4-702c-4641-997d-b439f651bc80.mov

cormacrelf avatar Mar 26 '22 06:03 cormacrelf

Update:

aabot 2

cormacrelf avatar Mar 28 '22 16:03 cormacrelf

I'm closing this issue given how long it's been.

It should be said that after some recent refactoring this would be possible with a new Feeder (see the one we use for Google Sheets) which is the logic that fetches the links that need to be archived.

msramalho avatar Aug 17 '23 17:08 msramalho