Internet Archive API seems to be broken as of Jul 10, 2020: All URLs getting marked as down
Hi everyone, I wonder who else is still using Amber!?
We've still got it on globalvoices.org, where our installs cumulatively have millions of URLs being tracked by Amber.
We use it with Internet Archive as our backend, which has worked well for years now.
Unfortunately as of July 10, 2020, Amber has gone haywire, and is now marking all URLs as down, and showing the popup box via JS on them regardless of their actual up/down status. Obviously this is disastrous, as all our brand new posts are having their links falsely marked as down.
I think there are two key problems here, one with the actual IA API, and one with how Amber handles API failures:
The Internet Archive API seems to no longer be returning the header we need.
After much spelunking in the amber_wordpress code, I found the place where Amber's attempts to fetch a cache from IA are failing.
In Amber->fetch_item() it uses $fetcher->fetch($item), which in turn calls InternetArchiveFetcher->fetch().
At that point it goes to the endpoint for the fetched URL:
$api_endpoint = join("",array(
$this->archiveUrl,
"/save/",
$url));
$ia_result = AmberNetworkUtils::open_single_url($api_endpoint, array(), FALSE);
This gives a URL like:
http://web.archive.org/save/http://google.com
It then takes the result, $ia_result, and analyzes its "info" and "headers".
['headers']['Content-Location'] is missing every time
For me the issue comes up when it looks for $ia_result['headers']['Content-Location']:
if (!isset($ia_result['headers']['Content-Location'])) {
throw new RuntimeException("Internet Archive response did not include archive location");
}
For me, every single URL being requested is getting this error.
"Internet Archive response did not include archive location" is being saved in the message field of amber_check.
So what do we do about that problem? Seems to me that there's something wrong with the endpoint:
http://web.archive.org/save/http://google.com
When I visit it I don't get a ['Content-Location'] header, or anything that really looks like it. Maybe this changed?
Okay, so that leads me to the other part of this problem, which I think is a genuine bug in Amber that just happens to be triggered by this particular "outage" of the IA API.
When the IA API fails, Amber completely stomps on the amber_check record for the URL, making it seem down when it's not
So it makes sense that when the Amber->fetch_item() fails to generate a cache, it will make a note of it in the database, but what currently happens is the most extreme possible version of that, and it is so extreme that I suspect it was unintentional.
In Amber->fetch_item(), after running $fetcher->fetch() (in this case, ultimately calling InternetArchiveFetcher->fetch()) it checks for errors, and if there are some, it re-saves the amber_check db record with $status->save_check:
} catch (RuntimeException $re) {
$update['message'] = $re->getMessage();
$update['url'] = $item;
$status->save_check($update);
return false;
}
The problem with this is that earlier in the execution we already generated and saved a save_check() record based on our direct checking of the URL.
The earlier save happens in Amber->cache_link() just after running $checker->check():
if (($update = $checker->check(empty($last_check) ? array('url' => $item) : $last_check, $force)) !== false) {
$status->save_check($update);
When a URL is working, that code will save accurate information like this to the db:
id: c7b920f57e553df2bb68272f61570210
url: http://google.com
status: 1
last_checked: 1595885887
next_check: 1596058687
message:
This is what we want, of course. If a URL is working then we want its status to be 1 so that the frontend of our sites won't throw up the "this is probably down" popup when people click.
We also want to keep the last_checked and next_check values.
But an API failure deletes all that info!
The issue is that by re-saving the amber_check value later, during Amber->fetch_item(), we end up obliterating all that info, and replacing it with a nearly empty record (just message and url, because that's what we saved in the code above):
id: c7b920f57e553df2bb68272f61570210
url: http://google.com
status: 0
last_checked: 0
next_check: 0
message: Internet Archive response did not include archive location
It's possible this is intentional, but if so, it seems like a really bad idea.
The only new info we really have is the message, so updating only that is what makes the most sense to me.
Proposed code fix
The following update to Amber->fetch_item() fixes the problem for me, by first loading the full array of amber_check info for the URL (which was just generated a second ago!), then updating only the message before resaving it:
} catch (RuntimeException $re) {
// $update['message'] = $re->getMessage();
// $update['url'] = $item;
// $status->save_check($update);
$new_check_record = $status->get_check($item);
$new_check_record['message'] = $re->getMessage();
$status->save_check($new_check_record);
If we fix that part of the code then even when the API fails to satisfy our caching code, we at least still have an accurate picture of whether the URL is up or down, so we're not ending up with JS popups when people click links that should have just worked.
Conclusion
There's other bugs and weird stuff I've discovered while trying to figure this out, but these two are the ones that are totally ruining my ability to keep Amber active on our site.
If anyone is still working on this plugin, please consider updating the save_check() logic for the sake of everyone.
If anyone at all knows of a change I can make to InternetArchiveFetcher->fetch() that will make the actual caching start working again, I would love to know!
Thanks to anyone who took the time to read through this.
@jlicht Not sure if you're still working on this, or who might know the answer, but I'd appreciate it if you had time to take a look at this issue, if only to tell me whether I'm crazy in thinking that the Internet Archive API endpoint is totally broken or not.
🙏🏻
Hi Jer, I used to work on Amber but am no longer at BKC. I'm going include @jsdiaz here to make sure someone internal to the Organization sees this. I'm sure they'll appreciate the detailed issue report!
Thanks Ryan!
If anyone sees this and just knows where I can find docs for the web.archive.org/save/ endpoint from InternetArchiveFetcher that would help a lot.
Some more research shows lots of people expecting that Content-Location header to be there (Stack Overflow Question, researchers posting about it with examples as recent as July 8, but not after July 10)
But my attempts to replicate their example code on the command line aren't going well. I'm not geting Content-Location, but I'm also not getting any 200 either, but a variety of 50* errors after long long waits.
Maybe this comes down to some big problems over at IA. I tried tweeting them but didn't get anything back.
@jerclarke I don't see any reason to doubt that your analysis of the problem, either with regards to IA's behavior or the bug that you've identified. Thanks for the detailed investigation!
Alright, an update on the API question.
This post from Oct 2019 seems very relevant: The Wayback Machine’s Save Page Now is New and Improved
It doesn't mention the "endpoint" type usage that Amber and many bookmarklets made of the old web.archive.org/save/ service, but it talks about a big change to the overall "Save Page Now" feature, and this comment implies that to some degree the web.archive.org/save/ GET based service has been broken since the Oct 2019 changes:
I’ve been using a Save Page Now bookmarklet that doesn’t work anymore since this feature was launched. It simply appends the URL:
javascript:(function(){location.href=’http://web.archive.org/save/’+(location.href);})();
I looked at the new source and the problem is the form now requires a fancy
POSTmethod. Why break what has worked?
Unfortunately I can't find any links anywhere to indicate what the POST method might be, and I'm not sure where the "source" mentioned in the comment could be found.
Another lead is this gist that uses the old endpoint who's author also seems to think there is a new POST system (not sure if they are basing it on the same comment I found or not).
I tried just looking at dev tools when using the website version of /save/ and it seems like the POST request is super simple, just url=$url.
When I run that request through PHP (WordPress HTTP API) it seems to work based on the content that's returned, but there's still no Content-Location header:
$result = wp_remote_post('http://web.archive.org/save/', array( 'body' => array('url'=>'http://google.com')));
Headers:
[server] => nginx/1.15.8
[date] => Thu, 30 Jul 2020 22:48:12 GMT
[content-type] => text/html; charset=utf-8
[cache-control] => no-cache
[x-app-server] => wwwb-app102
[x-ts] => 200
[x-location] => /save/
[x-cache-key] => httpweb.archive.org/save/MX
[content-encoding] => gzip
Relevant section of body:
Saving page http://google.com
The capture is estimated to start in 0 minutes.
Save also in my web archive.
Done!
In the normal "browser" version of the save page, it first shows a progress message like "Saving..." then eventually you get this message:
A snapshot was captured. Visit page: /web/20200730225500/https://www.google.com/
It's possible there's no API way to get the content-location without waiting for that page anymore...
Alright, I think that was an appropriate amount of effort in the name of archiving the open web.
Hopefully someone someday finds this and gives me a hint at how I can programatically save a URL to WayBack and get the path to the archive to save it to the amber_cache db.
Until then, sadly, Amber will have to go dormant on Global Voices. RIP sweet Amber, you were a pain in the butt as well as a worthy moonshot.
Making some notes about things I couldn't find earlier using Google:
"api.archivelab.org" seems to have an API that is relevant, and which was noted as down then "fixed" over the last few days:
https://github.com/ArchiveLabs/api.archivelab.org/issues/21
When I tried it it was still broken though, so 🤷🏻♀️
These docs seem to offer an API that will work at least in theory for Amber:
https://archive.readme.io/docs/creating-a-snapshot
The documentation site links to this JSON API for Wayback which I had seen before, but doesn't seem to offer an option to save, and thus doesn't seem like it will help Amber, but maybe I'm missing something. I'm not super hot at JSON API usage.