Discord-Scraper icon indicating copy to clipboard operation
Discord-Scraper copied to clipboard

Script only fetches latest posts

Open carado opened this issue 5 years ago • 6 comments

(hi, sorry about pressing enter too fast, I guess I'll make this a real issue now)

I've been trying to run the script on a channel with a few years of history but it'll only fetch the last 6 messages (most of them twice) into text.db and thus also only the one image that was posted in them. As opposed to the expected thousands of messages. (I can confirm that those 6 messages are indeed the last 6 from the channel in question).

I've tried to run the script on my VPS in case it was an issue with my relatively unreliable internet connection but the result is the same.

Do you have any idea what might be causing this ?

carado avatar Nov 27 '19 12:11 carado

I can confirm that the function "get_day" is working:

Testing out the timestamps

Testing out the get_day function

And I can confirm that the results should be the exact same as the search function from within Discord:

Printing out the JSON results

So the issue has to rest somewhere with SimpleRequests or with the SQLite insert statement since I know that there's bound to be issues with certain character encodings that might contribute to the lower number of posts in the SQLite database.

The duplicates issue is a strange one, but we're at the mercy of the Cloudflare CDNs as any issues with them will result in issues with the scraped outputs. So I'm going to be doing some more heavy revisions to the code and we'll check to see if the next commit fixes these issues.

Dracovian avatar Nov 27 '19 22:11 Dracovian

It definitely doesn't scrape all channel content since the beginning, just the most recent activity.

ch40s avatar Jan 02 '20 04:01 ch40s

That's most likely a limitation on the Discord side (API or CDN). If we had output to text or json we could easily confirm if SQLite has to do anything with it or not.

ch40s avatar Jan 02 '20 21:01 ch40s

@ch40s SQLite might have something to do with incomplete text grabbing, but nothing to do with incomplete image grabbing.

I was able to confirm that the individual functions used to retrieve server contents via the undocumented Discord search API endpoint are functioning as expected in those screenshots.

So the issue likely has to deal with either the downloading function (which also had issues with embedded contents due to missing request headers) or the file streaming function that I added to SimpleRequests is problematic.

Dracovian avatar Jan 03 '20 06:01 Dracovian

@carado I have added the experimental branch but I have yet to implement any decent configuration tutorial. Even then I figure that the configuration process should still be somewhat comparable to the one found in the master branch.

@ch40s The experimental branch currently defaults to writing to a JSON file for storing text data. Let me know if it's working for you or if it's not what you had in mind. I'm still going to implement the SQLite database method for the ease of traversal (at least for those who know their way around SQLite databases).

Also if the experimental branch code is broken too, open up a new issue with the "experimental" tag so that I can easily differentiate the issues between the master branch code and the experimental branch code. Thank you all for your patience, hopefully I can figure out a solution to these issues that will last longer than a few months.

Dracovian avatar Jan 22 '20 11:01 Dracovian

Let's see if I have fixed this issue in the latest branch, this time around I'm making use of 3rd party modules in lieu of my own custom modules.

Dracovian avatar Apr 23 '20 10:04 Dracovian