Long Scan Times from Additional HTTP Requests
Describe the bug I noticed that TumblThree app scan times are much higher than expected for blogs with duplicates and decided to look into this.
The TumblThree app seems to be sending a HTTP request to ".media.tumblr.com/" for each duplicate found, creating a large amount of additional HTTP requests. The initial json response "/api/read/json?debug=1&num=..." seems to have a unique file reference ID that could be pulled from "regular-body". Greatly reducing the number of requests needed to complete the scan and reducing the server load. You can replicate this by enabling "force rescan" and using any HTTP logger of your choice. This issue impacts rescan, reblogs, duplicates, etc and I think this would be useful for a lot of users. Sadly I don't have the coding background to fix this myself, which is why I am raising this issue.
To Reproduce Steps to reproduce the behavior:
- Setup HTTP monitoring or debug trace for TumblThree.
- Start TumblThree with deduplication setting enabled and rescan an existing site that was already processed.
- See the additional ".media.tumblr.com/" requests for files already in the index cache.
Expected behavior Fast scan times with only the json file if content is duplicates.
Desktop (please complete the following information):
- TumblThree version: v2.13
- OS: Windows 10 Home
- Browser: Chrome
- Version 125
Are you downloading normal or 'hidden' blogs? What are your settings? Any other relevant information?
Well, the missing information was that the already downloaded files were downloaded for another blog and not for the scanned one. And the affected posts are those with embedded images, so the JSON structure isn't that helpful.
We'll change it to check not only the current blog but also all other blogs for duplicates in this case.
The issue has been fixed and closed. You can still comment. Feel free to ask for reopening the issue if needed.