bridgy
bridgy copied to clipboard
support HTML meta http-equiv=refresh redirects
right now, when we resolve URLs and follow redirects in util.follow_redirects()
, we support HTTP 301/2 and the HTTP Refresh
header, but not HTML <meta http-equiv=refresh ...>
.
this is currently biting @pierreozoux (https://www.brid.gy/twitter/pierreozoux). for example, http://www.pierre-o.fr/s/9.htm doesn't advertize a webmention endpoint, but includes an HTML meta refresh redirect to http://www.pierre-o.fr/blog/2014/08/02/permashortlinks/ , which does.
Yes, I didn't want to open an issue as it is more my issue :) I was looking at possibilities on Github pages. There are none. I was looking at your code, and if there are standards way of doing this?
Like how would google do to parse such page?
Like should we follow the <link rel=canonical href="/blog/2014/08/02/permashortlinks/">
Or should we follow <meta http-equiv=refresh content="0; url=/blog/2014/08/02/permashortlinks/">
If you have some ideas, then when I'll have time, I'll dig into that with your python code!
thanks for the info! so, browsers basically ignore link rel=canonical
but follow meta http-equiv=refresh
. we probably want bridgy to do the same thing. (google probably pays attention to both, but the details there don't really matter as much.)
as for github pages, would the jekyll-redirect-from plugin work? details: https://help.github.com/articles/redirects-on-github-pages , http://joshualande.com/redirect-urls-jekyll-github/
...ah, never mind, i see you're already using jekyll-redirect-from.
hey @pierreozoux! just checking in in case you're still working on this but stuck on something. i'm happy to help if so!
I was wrong, this wouldn't go in follow_redirects
, since that only makes HEAD requests, which wouldn't see the HTML contents. From #1322:
Great! I think it would probably go here, and new test(s) here.
This is what I was thinking so far. I haven't implemented any tests quite yet.
I think the place you directed me first was correct because that is the first point that grabs contents.
I see what you are referring to now here. I will add the same parse_http_equiv
function that I added in the webutils package into the follow_redirect
function there. I will add a dual requests_fn
one that checks for the requests_head
first and for the non-redirect responses I will add a requests_get
that will check for meta redirect. As for performance, this will require that all attempt to get the webmention target must now be parsed with BeautifulSoup. Not sure how much overhead that will add, but I will make those changes.
Edit: I see that it is cached hourly so performance probably isn't a problem!
Edit 2: Added clarification to what I'll add.
Thank you for the details! And for noticing the overhead. Webmention endpoints are indeed cached, but per domain, and Bridgy regularly runs get_webmention_target
on 10-20k+ domains. That's currently ~2qps steady state, so I'm reluctant to switch all of that to GETs and start downloading all of those pages:

How about adding it to discover
in https://github.com/snarfed/webutil/blob/main/webmention.py instead? That already downloads the full response body.
I've also asked on #indieweb-dev if anyone else has experience or thoughts: https://chat.indieweb.org/dev/2022-10-19#t1666148871932600
How about adding it to
discover
in https://github.com/snarfed/webutil/blob/main/webmention.py instead? That already downloads the full response body.
Sure! I can do that, it will still require doubling get requests though since an additional search through the content will have to be made.
If this is not feasible due to performance though, I completely understand and am willing to scrap my idea.
Hmm! discover
already makes a GET request, and parses the HTML contents if there's no webmention endpoint in the headers, so it shouldn't need any additional HTTP requests or expensive parsing, right?
Right, but upon getting that http-equiv from the document head in the parsed request another request must be made to get the webmention endpoint in new url.
For example: discover(url='https://slyduda.com/p/29835') -> 'https://slyduda.com/post/0238975ur290834572038945' (http-equiv) discover(url='https://slyduda.com/post/0238975ur290834572038945') -> 'https://webmention.io/slyduda.com/webmention' (webmention meta tag)
And now I am realizing that you're saying that it wouldn't necessarily impact performance much because we 1) try to find the webmention url and if it doesn't exist we 2) look for a http-equiv! Got it!!! :)
Right!
One other thing, this should probably go behind a new boolean kwarg in discover
, maybe follow_meta_refresh
, that defaults to False
.
Confirmed working for me after refreshing past response!
Woo, awesome, glad to hear it! Thank you again!
Noticed that webmentions are not being sent to the correct target. This is okay-ish for my use case, but definitely ideal. I would like to add a third return value to the discover function in the webutil package that gets populated with the last client-side redirect if any. In Bridgy discover we can replace target
's value with the corrected_target
value if it is returned by the discover function. This is a little less elegant than I had hoped for, but would be the easiest way to fix this last little issue.
For added clarification: the target in the original post is slyduda.com/p/4thf0y and the endpoint is found correctly during discover. But since the target is not being changed to the correct client side redirected value instead of going to the updated target https://slyduda.com/post/6c9255e6-832e-4632-b7a2-978f30a9046b it goes to the original target.
Apologies for not catching this earlier. I got really happy when I saw "Sent!" and I forgot to check the webmention results.
Good follow-up! This can definitely be an awkward surprise, but it's actually intended webmention behavior. As part of receiving a webmention, the receiver fetches the source page and checks that it contains a link to the target. If we change the target, the receiver (webmention.io here) will look for a link that doesn't exist in the source page, so verification will fail. It's up to the receiver to handle shortlinks and other redirecting targets.
The spec discusses this briefly:
The receiver SHOULD check that target is a valid resource for which it can accept Webmentions. This check SHOULD happen synchronously to reject invalid Webmentions before more in-depth verification begins. What a "valid resource" means is up to the receiver. For example, some receivers may accept Webmentions for multiple domains, others may accept Webmentions for only the same domain the endpoint is on.
For webmention.io, I expect it would return these wms with the shortlink targets, and you'd have to merge them on your end. Some background in https://github.com/aaronpk/webmention.io/issues/92.
(Oh btw, with your client side redirect change, the requests.Response
that discover
returns will now contain the final redirected URL, so users can get it if they do need it.)
For webmention.io, I expect it would return these wms with the shortlink targets, and you'd have to merge them on your end. Some background in https://github.com/aaronpk/webmention.io/issues/92.
Thank you for the background. This makes sense. From what I'm gathering, in order for what I have described to work the whole spec would have to account for client side redirects which doesnt seem feasible. Though, it works enough now and I can definitely easily provide two values in the webmention url fetch from webmention.io
!
Thank you so much for the help and guidance! I really appreciate it.
You're welcome! Glad this is still working out ok. Feel free to jump into chat or https://github.com/w3c/webmention/issues to discuss more, eg https://github.com/w3c/webmention/issues/103 or others in https://github.com/w3c/webmention/issues?q=is%3Aissue+redirect !