Discord-Scraper icon indicating copy to clipboard operation
Discord-Scraper copied to clipboard

Cache JSON response data for faster updating.

Open Dracovian opened this issue 4 years ago • 4 comments

Currently the script does the following steps:

  1. Read from the config data to gather the guild and channel IDs that are to be scraped.
  2. Form a simple search request to Discord's server (undocumented API endpoint).
  3. Retrieve the JSON response from Discord's server.
  4. Sift through the JSON response to find all of the attachments and embedded content.
  5. Retrieve the binary contents of the attachments and embedded content and save to disk.

This is not a problem if you were to only run the script one time. But this becomes an issue whenever you undergo multiple runs of the script.

I do attempt to alleviate this by starting from the most recent date (today) and then proceeding to traverse further back in time until I reach the end of the Discord epoch (January 1, 2015).

Even then this could be much faster and here's how I plan to accomplish this:

  1. Read from the config data to gather the guild and channel IDs that are to be scraped.
  2. Determine if we already grabbed the JSON contents from a previous run.
  • If we have already grabbed the JSON contents then skip to the next day.
  • If we haven't grabbed the JSON contents, then proceed to step 3.
  1. Retrieve the JSON response from Discord's server.
  2. Calculate a checksum of the JSON contents and write both in a cache directory.
  • Continue to do this until we've gathered all of the JSON data available in the channel.
  • Once we have all of the JSON data, we can now proceed to step 5.
  1. Load each of the cached JSON data in order from newest to oldest.
  2. Sift through the loaded JSON data to find all of the attachments and embedded content.
  3. Retrieve the binary contents of the attachments and embedded content and save to disk.
  4. Mark the JSON files as "exhausted" to avoid having the script load it in the future when updating.

This does seem like more steps, but it's faster to read from a local cache than to make a request for the data every time we run the script. Even though we ensure that we're not overwriting existing files, the script still has to download the JSON contents to determine whether the target file has been grabbed or not.

Dracovian avatar Apr 23 '20 11:04 Dracovian

I do attempt to alleviate this by starting from the most recent date (today) and then proceeding to traverse further back in time until I reach the end of the Discord epoch (January 1, 2015).

It might not be the best place to point this out, however: this will propel the API to spit 400s for a possibly long time when the scraper inevitably goes past the channel's creation date:

(...)
[WARN] HTTP 400 from https://discord.com/api/v8/channels/<snip>/messages/search?min_id=-51489275904000000&max_id=-51126892232704000&include_nsfw=true.
[WARN] HTTP 400 from https://discord.com/api/v8/channels/<snip>/messages/search?min_id=-51851663769600000&max_id=-51489280098304000&include_nsfw=true.
[WARN] HTTP 400 from https://discord.com/api/v8/channels/<snip>/messages/search?min_id=-52214051635200000&max_id=-51851667963904000&include_nsfw=true.
[WARN] HTTP 400 from https://discord.com/api/v8/channels/<snip>/messages/search?min_id=-52576439500800000&max_id=-52214055829504000&include_nsfw=true.
(...)

mataha avatar Feb 12 '21 22:02 mataha

It might not be the best place to point this out, however: this will propel the API to spit 400s for a possibly long time when the scraper inevitably goes past the channel's creation date:

I thought I dealt with this problem at this location in the code.

Can you confirm that you have the most up-to-date files, otherwise I'll have to figure out a better method of checking to see if the snowflake that's being returned doesn't go below the earliest possible snowflake (January 1, 2015) for Discord.

Dracovian avatar Feb 12 '21 23:02 Dracovian

I am running the latest code from the experimental branch.

Furthermore - that line doesn't check a channel's creation date in any way; it could be created after January 1, 2015.

mataha avatar Feb 12 '21 23:02 mataha

I am running the latest code from the experimental branch.

Okay, that means my method isn't working then...

Furthermore - that line doesn't check a channel's creation date in any way; it could be created after January 1, 2015.

Oh no, it's not supposed to check the channel's creation date, instead it's supposed to determine if the checked date is after January 1, 2015 due to that being the lowest possible timestamp supported by Discord's snowflake system.

What's going on here is that the script is running the equivalent of a series of channel searches starting at either the current date or the date of the last visible post in the channel (based on the channel permissions as it pertains to your account or any bot account you may be using for this script). From there it runs through every single day backwards and it's supposed to stop at January 1, 2015 since that'll produce the smallest possible snowflake value.

I could have the snowflake calculation function return None if the input timestamp is lower than January 1, 2015 (UTC I imagine despite the search feature relying heavily on the local timezone offset... oh, I think I might have figured out why the problem persists).

The next commit should fix this issue, it's another oversight on my behalf.

Dracovian avatar Feb 12 '21 23:02 Dracovian