amp2html
amp2html copied to clipboard
Fails to remove trailing amp URL markup
Here are two examples which redirect to the wrong page:
https://www.google.com/amp/s/www.etonline.com/the-bachelorette-season-16-episode-10-Tayshia-Adams-men-tell-all-2020-12-14-live-updates%3famp
- I just want to say for posterity's sake that I have no interest in the bachelorette, but some of my friends do 😂
https://www.google.com/amp/s/www.washingtonpost.com/nation/2020/12/09/idaho-coronavirus-protest-homes/%3foutputType=amp
The issue is the trailing parts of the URL. After redirect, the first example's URL has an %3famp
at the end which, when removed, gives the correct page,
https://www.etonline.com/the-bachelorette-season-16-episode-10-Tayshia-Adams-men-tell-all-2020-12-14-live-updates
The second example's URL has an extra ?outputType=amp/
which, when removed, gives the correct page,
https://www.washingtonpost.com/nation/2020/12/09/idaho-coronavirus-protest-homes/?outputType=amp/
Some other similar examples such as
https://www.google.com/amp/s/nypost.com/2020/12/07/queens-boy-steals-family-car-drives-away-with-sister-to-nj/amp/
https://www.google.com/amp/s/m.jpost.com/omg/former-israeli-space-security-chief-says-aliens-exist-humanity-not-ready-651405/amp
enjoy server-side redirects to the content of 'interest' (again: no real interest, just an example).
This may be a server-side problem---but if half of servers have a wrong implementation, it's something this extension could help with.
https://www.google.com/amp/s/www.etonline.com/the-bachelorette-season-16-episode-10-Tayshia-Adams-men-tell-all-2020-12-14-live-updates%3famp https://www.google.com/amp/s/www.washingtonpost.com/nation/2020/12/09/idaho-coronavirus-protest-homes/%3foutputType=amp
Okay, so this is a bug in the extension. It should URL decode the URL to get “?amp” instead of “%3famp”. After fixing that, it should get the right pages and redirect to them to the right location.
https://www.google.com/amp/s/nypost.com/2020/12/07/queens-boy-steals-family-car-drives-away-with-sister-to-nj/amp/ https://www.google.com/amp/s/m.jpost.com/omg/former-israeli-space-security-chief-says-aliens-exist-humanity-not-ready-651405/amp
The extension redirects these URLs to the canonical HTML location when the page loads. What problem are you seeing with these ones? It’s not the same issue as with the others.
The latter ones I didn't have a problem with---I just noticed that they had amp-related suffixes.
When I load the target of the redirect (which succeeds), the server rewrites the URL again, removing the /amp
or /amp/
suffix (among other rewrites). So I wasn't sure if, according to the amp protocol, handling the suffix is something the extension should be doing or if it's the server's responsibility. So those latter examples aren't problems, they're just contrasts for the problematic ones---URLs where an amp-related suffix is handled correctly (even though it is ignored by the extension).
Hm. So the AMP Cache URLs shouldn’t have been urlencoded, and decoding them when they’re not encoded can lead to unexpected results. They’re not urlencoded on the publishers AMP page nor when served by Google or Bing. Where did these URLs come from? Did someone send them to you? Using what app? or did you get them from a search result page? If I can find out where they came from, I can figure out where things went wrong and what the underlying problem is.
When I load the target of the redirect (which succeeds), the server rewrites the URL again, removing the /amp or /amp/ suffix (among other rewrites). So I wasn't sure if, according to the amp protocol, handling the suffix is something the extension should be doing or if it's the server's responsibility.
AMP redirection happens in two steps: the first step is to redirect off from an AMP Viewer or AMP Cache (e.g. Google, Bing, Baidu). This is done my rewriting URLs that match a pseudo-standard pattern. This redirects you to the publishers AMP page. As you’ve noticed, every publisher applies their own patterns to determine the URL for their AMP versions of pages. The extension can’t blindly chop off suffixes like “/amp/” because it can’t know whether that refers to the AMP tech or e.g. the page is about an electronic amplifier The second steps is parsing the AMP page to find the non-AMP version of that page. Lastly, you’re redirected to the non-AMP version.
It’s a convoluted journey, but it gets there in the end. Complications like this is why many web developers feel strongly against AMP in the first place. There should be only one web page and one URL for any given document.
These URLs came to me via slack. I do not know how the people on the other end got them to be the way they are android vs ios, copy/paste vs an OS share button, etc. But I can inquire if it'd help.
I'm with you on anti-AMP sentiments, which is why I was delighted to find this extension in the first place!
https://github.com/da2x/amp2html/blob/088d9f7fd72eb5c7c72ab4ea6c984e35b963d677/scripts/redirector.js#L21-L36
I’ve attempted a fix. It’s not great, see code comments above.
I’ll release it in a few days after some more testing.
You can help with testing.
Disable the current extension. Download and unzip the test version, and load it temporarily in Firefox or Chrome/Edge.