bulk-downloader-for-reddit
bulk-downloader-for-reddit copied to clipboard
[BUG] Imgur 404 error but link works in browser
- [x] I am reporting a bug.
- [x] I am running the latest version of BDfR
- [x] I have read the Opening an issue
Description
imgur links keep giving a 404 error even though they work on my browser. An imgur link such as https://i.imgur.com/xxxxxx.gifv opens up on my browser. https://i.imgur.com/xxxxxx WITHOUT the gifv extension loads a 404 page. The two 404 links in the log I provided work fine on my browser using the i.imgur link that ends with .gifv extension
Command
python3 -m bdfr download L:\bdfr --subreddit thatsthespot --no-dupes
Environment
- OS: Windows 10
- Python version: 3.10.11
Logs
[2023-05-31 09:50:26,348 - bdfr.connector - DEBUG] - Setting maximum download wait time to 120 seconds
[2023-05-31 09:50:26,348 - bdfr.connector - DEBUG] - Setting datetime format string to ISO
[2023-05-31 09:50:26,349 - bdfr.connector - DEBUG] - Disabling the following modules:
[2023-05-31 09:50:26,349 - bdfr.connector - Level 9] - Created download filter
[2023-05-31 09:50:26,349 - bdfr.connector - Level 9] - Created time filter
[2023-05-31 09:50:26,349 - bdfr.connector - Level 9] - Created sort filter
[2023-05-31 09:50:26,350 - bdfr.connector - Level 9] - Create file name formatter
[2023-05-31 09:50:26,350 - bdfr.connector - DEBUG] - Using unauthenticated Reddit instance
[2023-05-31 09:50:26,351 - bdfr.connector - Level 9] - Created site authenticator
[2023-05-31 09:50:26,802 - bdfr.connector - DEBUG] - Added submissions from subreddit thatsthespot
[2023-05-31 09:50:26,803 - bdfr.connector - Level 9] - Retrieved subreddits
[2023-05-31 09:50:26,803 - bdfr.connector - Level 9] - Retrieved multireddits
[2023-05-31 09:50:26,803 - bdfr.connector - Level 9] - Retrieved user data
[2023-05-31 09:50:26,803 - bdfr.connector - Level 9] - Retrieved submissions for given links
[2023-05-31 09:50:38,557 - bdfr.downloader - DEBUG] - Attempting to download submission 13w4i73
[2023-05-31 09:50:38,558 - bdfr.downloader - DEBUG] - Using Imgur with url https://i.imgur.com/DnZYrnB.gifv
[2023-05-31 09:50:38,750 - bdfr.downloader - ERROR] - Site Imgur failed to download submission 13w4i73: Server responded with 404 to https://imgur.com/DnZYrnB
[2023-05-31 09:50:38,751 - bdfr.downloader - DEBUG] - Attempting to download submission 13vrada
[2023-05-31 09:50:38,751 - bdfr.downloader - DEBUG] - Using Redgifs with url https://redgifs.com/watch/parchedvalidhog
[2023-05-31 09:50:38,939 - bdfr.downloader - DEBUG] - File L:\bdfr\thatsthespot\twitchrule_She is really really cute when she want that cum_13vrada.mp4 from submission 13vrada already exists, continuing
[2023-05-31 09:50:38,939 - bdfr.downloader - INFO] - Downloaded submission 13vrada from thatsthespot
[2023-05-31 09:50:38,940 - bdfr.downloader - DEBUG] - Attempting to download submission 13vc2l6
[2023-05-31 09:50:38,940 - bdfr.downloader - DEBUG] - Using Redgifs with url https://www.redgifs.com/watch/pointlesscanineibizanhound#rel=user%3Aariacolexo;order=new
[2023-05-31 09:50:39,105 - bdfr.downloader - DEBUG] - File L:\bdfr\thatsthespot\ariacole___I am an expert at finding just the right spot.._13vc2l6.mp4 from submission 13vc2l6 already exists, continuing
[2023-05-31 09:50:39,106 - bdfr.downloader - INFO] - Downloaded submission 13vc2l6 from thatsthespot
[2023-05-31 09:50:39,106 - bdfr.downloader - DEBUG] - Attempting to download submission 13vnb2c
[2023-05-31 09:50:39,106 - bdfr.downloader - DEBUG] - Using Imgur with url https://i.imgur.com/SScXYtM.gifv
[2023-05-31 09:50:39,306 - bdfr.downloader - ERROR] - Site Imgur failed to download submission 13vnb2c: Server responded with 404 to https://imgur.com/SScXYtM
I also cannot download anything via Imgur, regardless of file type despite the link working as intended in the browser.
Except for me the error is (for every download):
Site Imgur failed to download submission xxxxxx: server responded with 404 to https://api.imgur.com/3/image/yyyyyyy
Can confirm as well, that I too can't download anything from the imgur.
Yes and navigating to the link in a browser will unveil the reason: error | "Authentication required" It would seem the API has been gated. The error reported by BDFR is a 404, even though the actual error is a 401. This might be a bug in the code, unrelated to this issue.
Yes and navigating to the link in a browser will unveil the reason: error | "Authentication required" It would seem the API has been gated. The error reported by BDFR is a 404, even though the actual error is a 401. This might be a bug in the code, unrelated to this issue.
The reason you're getting 401 from that link is the same reason I mention in #828 you're missing the auth headers to access that API link.
As for the rest of the issue at hand here, There are a lot of things being removed from Imgur right now. It seems they're being removed from the API first and the direct file links will sometimes work for a bit afterwards. You can work around this for direct links with an edit to the download_factory but I would not advise it long term as any dead link will just pick up the removed image and treat it like it's been successful. Also any malformed links provided by the Reddit API can just download the HTML of the 404 page as the downloader will not see the redirect and think it's getting the right file. It's the main reason the change to the API was made in the first place.
If you are willing to run with those caveats or are willing to double-check them all here is the patch:
change this:
if re.match(r"(i\.|m\.|o\.)?imgur", sanitised_url):
return Imgur
elif re.match(r"(i\.|thumbs\d{1,2}\.|v\d\.)?(redgifs|gifdeliverynetwork)", sanitised_url):
return Redgifs
elif re.match(r"(thumbs\.|giant\.)?gfycat\.", sanitised_url):
return Gfycat
elif re.match(r".*/.*\.[a-zA-Z34]{3,4}(\?[\w;&=]*)?$", sanitised_url) and not DownloadFactory.is_web_resource(
sanitised_url
):
return Direct
to this:
if re.match(r"(i\.|thumbs\d{1,2}\.|v\d\.)?(redgifs|gifdeliverynetwork)", sanitised_url):
return Redgifs
elif re.match(r"(thumbs\.|giant\.)?gfycat\.", sanitised_url):
return Gfycat
elif re.match(r".*/.*\.[a-zA-Z34]{3,4}(\?[\w;&=]*)?$", sanitised_url) and not DownloadFactory.is_web_resource(
sanitised_url
):
return Direct
elif re.match(r"(i\.|m\.|o\.)?imgur", sanitised_url):
return Imgur
Any gifv links will download as such with that change. If you would like them downloaded as mp4 you can insert the two new lines to downloader at line 96:
try:
if submission.url.endswith(".gifv"):
submission.url = submission.url.replace(".gifv", ".mp4")
downloader_class = DownloadFactory.pull_lever(submission.url)
These edits are provided as-is and I won't be providing additional support for them.
Yes and navigating to the link in a browser will unveil the reason: error | "Authentication required" It would seem the API has been gated. The error reported by BDFR is a 404, even though the actual error is a 401. This might be a bug in the code, unrelated to this issue.
The reason you're getting 401 from that link is the same reason I mention in #828 you're missing the auth headers to access that API link.
As for the rest of the issue at hand here, There are a lot of things being removed from Imgur right now. It seems they're being removed from the API first and the direct file links will sometimes work for a bit afterwards. You can work around this for direct links with an edit to the download_factory but I would not advise it long term as any dead link will just pick up the removed image and treat it like it's been successful. Also any malformed links provided by the Reddit API can just download the HTML of the 404 page as the downloader will not see the redirect and think it's getting the right file. It's the main reason the change to the API was made in the first place.
If you are willing to run with those caveats or are willing to double-check them all here is the patch:
change this:
if re.match(r"(i\.|m\.|o\.)?imgur", sanitised_url): return Imgur elif re.match(r"(i\.|thumbs\d{1,2}\.|v\d\.)?(redgifs|gifdeliverynetwork)", sanitised_url): return Redgifs elif re.match(r"(thumbs\.|giant\.)?gfycat\.", sanitised_url): return Gfycat elif re.match(r".*/.*\.[a-zA-Z34]{3,4}(\?[\w;&=]*)?$", sanitised_url) and not DownloadFactory.is_web_resource( sanitised_url ): return Direct
to this:
if re.match(r"(i\.|thumbs\d{1,2}\.|v\d\.)?(redgifs|gifdeliverynetwork)", sanitised_url): return Redgifs elif re.match(r"(thumbs\.|giant\.)?gfycat\.", sanitised_url): return Gfycat elif re.match(r".*/.*\.[a-zA-Z34]{3,4}(\?[\w;&=]*)?$", sanitised_url) and not DownloadFactory.is_web_resource( sanitised_url ): return Direct elif re.match(r"(i\.|m\.|o\.)?imgur", sanitised_url): return Imgur
Any gifv links will download as such with that change. If you would like them downloaded as mp4 you can insert the two new lines to downloader at line 96:
try: if submission.url.endswith(".gifv"): submission.url = submission.url.replace(".gifv", ".mp4") downloader_class = DownloadFactory.pull_lever(submission.url)
These edits are provided as-is and I won't be providing additional support for them.
Oh i understand now. Some of the submissions where very recent so I hadn't considered they could already be removed.
or are willing to double-check them all here is
@OMEGARAZER
Is there a way to figure out which files need to be double checked? Then a way to save the corresponding file to the right location, named and all?
@AlexTu2
or are willing to double-check them all here is
@OMEGARAZER
Is there a way to figure out which files need to be double checked? Then a way to save the corresponding file to the right location, named and all?
bdfr has the --no-dupes option that promises to avoid downloading the same image/video twice by comparing hashes. Since the 'removed' image is the same every time, that option catches it. You'll just get one of them and bdfr will skip all other posts that were removed by imgur.
I'm currently re-downloading my saved posts with this fix and the --no-dupes option, the log displays "Resource hash d835884373f4d6c8f24742ceabe74946 from submission
Plus the images are all exactly the same (absurdly low) size. It's easy to use a tool like find to get them all.