bridgy icon indicating copy to clipboard operation
bridgy copied to clipboard

support HTML meta http-equiv=refresh redirects

Open snarfed opened this issue 10 years ago • 4 comments

right now, when we resolve URLs and follow redirects in util.follow_redirects(), we support HTTP 301/2 and the HTTP Refresh header, but not HTML <meta http-equiv=refresh ...>.

this is currently biting @pierreozoux (https://www.brid.gy/twitter/pierreozoux). for example, http://www.pierre-o.fr/s/9.htm doesn't advertize a webmention endpoint, but includes an HTML meta refresh redirect to http://www.pierre-o.fr/blog/2014/08/02/permashortlinks/ , which does.

snarfed avatar Aug 04 '14 17:08 snarfed

Yes, I didn't want to open an issue as it is more my issue :) I was looking at possibilities on Github pages. There are none. I was looking at your code, and if there are standards way of doing this?

Like how would google do to parse such page? Like should we follow the <link rel=canonical href="/blog/2014/08/02/permashortlinks/"> Or should we follow <meta http-equiv=refresh content="0; url=/blog/2014/08/02/permashortlinks/">

If you have some ideas, then when I'll have time, I'll dig into that with your python code!

pierreozoux avatar Aug 05 '14 08:08 pierreozoux

thanks for the info! so, browsers basically ignore link rel=canonical but follow meta http-equiv=refresh. we probably want bridgy to do the same thing. (google probably pays attention to both, but the details there don't really matter as much.)

as for github pages, would the jekyll-redirect-from plugin work? details: https://help.github.com/articles/redirects-on-github-pages , http://joshualande.com/redirect-urls-jekyll-github/

snarfed avatar Aug 05 '14 18:08 snarfed

...ah, never mind, i see you're already using jekyll-redirect-from.

snarfed avatar Aug 05 '14 18:08 snarfed

hey @pierreozoux! just checking in in case you're still working on this but stuck on something. i'm happy to help if so!

snarfed avatar Oct 18 '14 20:10 snarfed

I was wrong, this wouldn't go in follow_redirects, since that only makes HEAD requests, which wouldn't see the HTML contents. From #1322:

Great! I think it would probably go here, and new test(s) here.

snarfed avatar Oct 17 '22 13:10 snarfed

This is what I was thinking so far. I haven't implemented any tests quite yet.

I think the place you directed me first was correct because that is the first point that grabs contents.

slyduda avatar Oct 18 '22 23:10 slyduda

I see what you are referring to now here. I will add the same parse_http_equiv function that I added in the webutils package into the follow_redirect function there. I will add a dual requests_fn one that checks for the requests_head first and for the non-redirect responses I will add a requests_get that will check for meta redirect. As for performance, this will require that all attempt to get the webmention target must now be parsed with BeautifulSoup. Not sure how much overhead that will add, but I will make those changes.

Edit: I see that it is cached hourly so performance probably isn't a problem!

Edit 2: Added clarification to what I'll add.

slyduda avatar Oct 19 '22 01:10 slyduda

Thank you for the details! And for noticing the overhead. Webmention endpoints are indeed cached, but per domain, and Bridgy regularly runs get_webmention_target on 10-20k+ domains. That's currently ~2qps steady state, so I'm reluctant to switch all of that to GETs and start downloading all of those pages:

image

How about adding it to discover in https://github.com/snarfed/webutil/blob/main/webmention.py instead? That already downloads the full response body.

snarfed avatar Oct 19 '22 03:10 snarfed

I've also asked on #indieweb-dev if anyone else has experience or thoughts: https://chat.indieweb.org/dev/2022-10-19#t1666148871932600

snarfed avatar Oct 19 '22 03:10 snarfed

How about adding it to discover in https://github.com/snarfed/webutil/blob/main/webmention.py instead? That already downloads the full response body.

Sure! I can do that, it will still require doubling get requests though since an additional search through the content will have to be made.

If this is not feasible due to performance though, I completely understand and am willing to scrap my idea.

slyduda avatar Oct 19 '22 03:10 slyduda

Hmm! discover already makes a GET request, and parses the HTML contents if there's no webmention endpoint in the headers, so it shouldn't need any additional HTTP requests or expensive parsing, right?

snarfed avatar Oct 19 '22 04:10 snarfed

Right, but upon getting that http-equiv from the document head in the parsed request another request must be made to get the webmention endpoint in new url.

For example: discover(url='https://slyduda.com/p/29835') -> 'https://slyduda.com/post/0238975ur290834572038945' (http-equiv) discover(url='https://slyduda.com/post/0238975ur290834572038945') -> 'https://webmention.io/slyduda.com/webmention' (webmention meta tag)

And now I am realizing that you're saying that it wouldn't necessarily impact performance much because we 1) try to find the webmention url and if it doesn't exist we 2) look for a http-equiv! Got it!!! :)

slyduda avatar Oct 19 '22 04:10 slyduda

Right!

One other thing, this should probably go behind a new boolean kwarg in discover, maybe follow_meta_refresh, that defaults to False.

snarfed avatar Oct 19 '22 05:10 snarfed

Confirmed working for me after refreshing past response!

slyduda avatar Oct 24 '22 17:10 slyduda

Woo, awesome, glad to hear it! Thank you again!

snarfed avatar Oct 24 '22 17:10 snarfed

Noticed that webmentions are not being sent to the correct target. This is okay-ish for my use case, but definitely ideal. I would like to add a third return value to the discover function in the webutil package that gets populated with the last client-side redirect if any. In Bridgy discover we can replace target's value with the corrected_target value if it is returned by the discover function. This is a little less elegant than I had hoped for, but would be the easiest way to fix this last little issue.

For added clarification: the target in the original post is slyduda.com/p/4thf0y and the endpoint is found correctly during discover. But since the target is not being changed to the correct client side redirected value instead of going to the updated target https://slyduda.com/post/6c9255e6-832e-4632-b7a2-978f30a9046b it goes to the original target.

Apologies for not catching this earlier. I got really happy when I saw "Sent!" and I forgot to check the webmention results.

slyduda avatar Oct 25 '22 03:10 slyduda

Good follow-up! This can definitely be an awkward surprise, but it's actually intended webmention behavior. As part of receiving a webmention, the receiver fetches the source page and checks that it contains a link to the target. If we change the target, the receiver (webmention.io here) will look for a link that doesn't exist in the source page, so verification will fail. It's up to the receiver to handle shortlinks and other redirecting targets.

The spec discusses this briefly:

The receiver SHOULD check that target is a valid resource for which it can accept Webmentions. This check SHOULD happen synchronously to reject invalid Webmentions before more in-depth verification begins. What a "valid resource" means is up to the receiver. For example, some receivers may accept Webmentions for multiple domains, others may accept Webmentions for only the same domain the endpoint is on.

For webmention.io, I expect it would return these wms with the shortlink targets, and you'd have to merge them on your end. Some background in https://github.com/aaronpk/webmention.io/issues/92.

snarfed avatar Oct 25 '22 15:10 snarfed

(Oh btw, with your client side redirect change, the requests.Response that discover returns will now contain the final redirected URL, so users can get it if they do need it.)

snarfed avatar Oct 25 '22 15:10 snarfed

For webmention.io, I expect it would return these wms with the shortlink targets, and you'd have to merge them on your end. Some background in https://github.com/aaronpk/webmention.io/issues/92.

Thank you for the background. This makes sense. From what I'm gathering, in order for what I have described to work the whole spec would have to account for client side redirects which doesnt seem feasible. Though, it works enough now and I can definitely easily provide two values in the webmention url fetch from webmention.io!

Thank you so much for the help and guidance! I really appreciate it.

slyduda avatar Oct 25 '22 16:10 slyduda

You're welcome! Glad this is still working out ok. Feel free to jump into chat or https://github.com/w3c/webmention/issues to discuss more, eg https://github.com/w3c/webmention/issues/103 or others in https://github.com/w3c/webmention/issues?q=is%3Aissue+redirect !

snarfed avatar Oct 25 '22 16:10 snarfed