Discord-Scraper icon indicating copy to clipboard operation
Discord-Scraper copied to clipboard

Experimental branch scraper gets everything but text

Open desperandos101 opened this issue 4 years ago • 4 comments

I'm trying to scrape data from a discord channel, and it's able to scrape photos and videos, but it doesnt scrape text. I have the option enabled in the config, and the program still creates a file directory when i run it, but it seems to stall indefinitely from that point onward. Nothing appears in the command line either. If it helps, I've also tried to scrape text using the master branch, and that works ok, it seems to only be a problem with the experimental branch.

desperandos101 avatar Jan 13 '21 20:01 desperandos101

Right now I'm running an extensive test of the JSON caching code that should run a test on an entire channel that's been around for nearly 5 years (2016-2021).

I still have yet to figure out a method of capturing CTRL + C inputs to stop the script without having to kill the python process manually through the commandline or through a GUI task manager.

The current config for the text scraping on the experimental branch is ignored, so I'm looking to make a push to the experimental branch to implement the text scraping and JSON data caching features.

From there I'll have to get the DM scraping portion to work and then the file header checking/sanitization and then the CRC checksum generation method before making the next push to the experimental branch code.

From there I'll have to implement the compression for text and images and then make use of the CRC hashes to determine if there were any duplicate files grabbed to remove them along with their CRC hash file to clean things up. The compression will be implemented so that it removes the decompressed copy after compression is complete and from there I'll make the third push to the experimental branch code.

After some further bug testing and refinements to the scraping process and config file I'll finalize the experimental branch by merging it into the main branch and thus the whole ordeal will be complete.

I'll send another comment in this issue to let you know when I have made the next push to the experimental branch which should be the fix for this issue.

Dracovian avatar Feb 06 '21 22:02 Dracovian

Okay, I figured out the CTRL + C issue and I have enabled JSON caching for the Discord API response data (which includes the text that was sent and information on who sent it based on several bits of data including their nickname (at the time of scraping), their discriminator (the 4 digits that allows for multiple accounts to have the same username), and the user ID (snowflake according to Discord's own API reference)).

Now I just have to implement the following before the branches can be merged together again:

  • DM Scraping
  • File header checking and sanitization
  • CRC checksum generation and duplicate file sanitization
  • Text and image compression

I did mention that I would add another comment when I've made the next push to the experimental branch, and I ensured that the JSON cache data will be formatted to be readable as opposed to how it looks directly from the API response from Discord.

Dracovian avatar Feb 10 '21 23:02 Dracovian

Same issue. I'm fine without other features as long as it can get text, does the non-experimental scrape text?

OMN1-H1V3 avatar Mar 14 '21 02:03 OMN1-H1V3

The non-experimental (main) branch code hasn't been touched in months.

The only thing that the main branch code does that the experimental branch code doesn't do is scrape DMs.

The experimental branch will scrape JSON data and that'll contain the text as well as all other details from who posted what and any attachments that were grabbed along with it.

The ideal outcome of this is that one could generate a webpage that reads from the JSON and can be potentially used to sift through as if they were in that Discord guild at that moment reading the posts themselves.

The experimental branch has a few more features needing to be added before it ends up merged into the main branch.

Dracovian avatar Mar 14 '21 21:03 Dracovian