jekyll-webmention_io
jekyll-webmention_io copied to clipboard
Bad urls cached to early
I have found yet another bug when creating issue #128. When I tried to give you an url to Bridgy (which does not have its webmention target defined) like homepage or about page it got catched as bad url and later on all syndication urls to bridgy were ignored.
So I needed to underscore all urls dot “.” to make wrong domain and not ignore syndication urls.
Maybe cache should be written only after sending all webmentions in batch? Not during send?
maybe this is the only one url which causes the issue - so that will be easier maybe to make it whitelisted in code?
(Originally published at: https://www.pawelmadej.com/issue/jekyll-webmentions_io-new-issue/)
Can you provide the URLs that failed the check and one that would pass? The whole "bad URLs" approach is intended to speed up the Jekyll build cycle, so I want to find the right balance.
Make post with below content. Replace underscore “_” in urls with dots “.”
---
title: examples
---
This url will trigger bad url cache [bridgy](https://brid_gy)
This url will hit bad url cache and will be ignored [](https://brid_gy/publish/github)
(Originally published at: https://www.pawelmadej.com/issue/jekyll-webmentions_io-129-comment-1/)
So I have a couple of thoughts on this one, as it's bitten me a couple of times as well.
There's a few ways I can think of that would cause a webmention can fail:
- The target doesn't publish a webmention endpoint. Unless the site operator implements webmention, future attempts will always fail.
- The site publishes a webmention endpoint, but it returned a 400-class error. For example, brid.gy will return an error if a duplicate webmention is sent.
- The site published a webmention endpoint, but there was a network failure or 500-class error.
The policy for dealing with these probably needs to be different!
In the first case, I would argue it makes sense to permanently mark the target site as bad. If, in the future, the site enables webmention, the user could remove the flag manually.
In the second case, I would argue the URL should not be marked bad at all. Instead, the user needs to resolve the issue on a case-by-case basis to correct the issue (in the brid.gy example, by removing the webmention or renaming the post to generate a new, unique URL).
The third case probably warrants temporary caching over some short interval until the failure is resolved.
Finally, for those cases that fall through the cracks, a whitelist/blacklist mechanism would be very helpful, so that the user can permanently either enable or disable webmentions for sites.
And in particular, in #131, the pull request for syndication, it would make a lot of sense to whitelist those hosts so they're never marked bad.
Thoughts?
I'd certainly be happy to put together a PR that implements this if it seems like it makes sense.
My idea is to make something like in mail service. few tries and then fail. This can work with increment time to next try and after for example five tries add url to black list.
Main idea is to blacklist url not whole domain as it is for brid.gy domain. if you mention it somewhere you can block whole service in plugin
maybe this can work like:
1st try 2nd try after 1h 3rd after 12h 4th after 2days 5th after 7days and then add url to ban list.
This will make failover to situation 1 and 3.
for 2nd this also could be done as I have few posts that are repeatedly submitted to brid.gy and are marked as duplicates. as I deploy my site via CircleCI I will not see this errors if I will not look at web mention stage in particular build. I think that fail after time can be accepted solution or there can be generated some log which could be used as artifact from build work at CI to look to.
than It could be also added something like try send whole blacklist every 1-2 month time because there can change situation and some sites could implement webmention in this period so they will get delivered what we mention.
I hope that my opinion will help a bit?
Well, the most flexible thing would be, for each of the "kinds" of error (no endpoint, 400-class request error, 500-class server error/network issue), allow the user to specify a policy.
I do think applying the policy to the whole domain makes sense, since that ensures the plugin isn't wasting time testing a bunch of URLs under the same failing domain. But you could use whitelisting with regexs for the "I accidentally banned brid.gy" problem (though, as I mentioned, I'd want that done automatically for syndication targets).
So I'm thinking something like this:
bad_uri_policy:
unsupported:
policy: ban
error:
policy: ignore
failure:
policy: retry
retry_delay: [ 1, 12, 48, 120 ]
max_retries: 5
whitelist:
- "https://brid.gy/publish/.*"
I could see eliding the "policy" line when it's a simple rule, i.e.:
bad_uri_policy:
unsupported: ban
error: ignore
failure:
policy: retry
retry_delay: [ 1, 12, 48, 120 ]
max_retries: 5
whitelist:
- "https://brid.gy/publish/.*"
Maybe with a "default" fallthrough, and default retry settings so that, if not overridden, the current out-of-the-box policy would be equivalent to:
bad_uri_policy:
default: retry
With the existing "cache_bad_uris_for" setting being a synonym for:
bad_uri_policy:
default:
policy: retry
retry_delay: [ <cache_bad_uris_for * 24> ]
And I completely agree, a log for errors, retries, and bans would be incredibly helpful to sort out what's going on. Though that might be better as a separate PR, to keep the work focused.
So we would have 3 policies ban, ignore and retry?
Proposed defaults are good to me for no config setup while allowing some overrides.
Yep making this separate would be easier to get the point.
Despite of changes we need to be sure to not make it backwards incompatible for certain custom configurations which could be in use. Maybe some deprecation warnings during build or in logs that config should be updated?
Alright, I've taken a run at an implementation of this proposal.
I've done a fair bit of testing in a playpen and on my own blog and it seems to be doing what I need.
This is designed in such a way that it's totally backward compatible with an existing installation. The old cache_bad_uris_for setting continues to work, and the bad URI cache entries are handled transparently by assuming the URIs are bad because webmention wasn't supported. It's not perfect, but I suspect it's the right assumption for most entries in the file.
The result is that, post-upgrade, with no changes, the plugin works the same way it always did.
I added a bunch of extra logging as part of the commit, so it's definitely more chatty. It might be desirable to remove that, but for now I find it useful.
Edit: As an aside, I'm fairly inexperienced in Ruby, so apologies if the code or style aren't terrible idiomatic...
BTW, as an aside, @nysander if there are individual posts where you're pushing duplicate webmentions to brid.gy, you can resolve that by just opening up <cache_dir>/webmention_io_outgoing.yml, finding the entries, and changing the value from "false" to "{}" or an empty string. That'll mark the webmention as sent and the plugin will subsequently ignore it.
And now I understand the other problem you have (i.e., wanting Jekyll to give up on a particular webmention after a while). A separate, source/target specific retry limit would also be really nice for that use case and would be pretty easy to cook up...
And I've also built a quick and dirty implementation of a new max_attempts setting that causes the plugin to give up on individual webmentions. And it's very late, so if it's not well thought through... in my defense, I'm not thinking that straight at this point. :)
Well to me it is noon now 😉 I will check this out but I have very little near zero ruby skills but some patterns maybe I will understand. More comments to go. Thanks for your work
I read through everything in this thread, but perhaps I missed something. What are the implications of each of the policies in terms of when (if ever) they are checked again, etc.? Could you do a basic breakdown in a list? This will be key for documenting the new configuration options.
Hmm, I tried to give a... passable... overview of the full feature in the doc updates here:
https://github.com/fancypantalons/jekyll-webmention_io/blob/cbeb16c9247850ea84589ca9ced501e405cb4c26/docs/bad_uri_policy.md
I'm more than happy to clarify the material, there, if it's not sufficiently clear!
Thank you for this! I'll be on vacation through December 6th but can take a look when I return.
I kept having Ruby/Jekyll issues when upgrading my Mac, so I have moved off of Jekyll and will not be working on this project anymore, going forward. I am going to flag this as won’t fix, but leave it open in case someone else wants to pick up the project from here.