Add fault-tolerance for cache system errors
Fixes #511.
This PR adds several features designed to increase fault-tolerance in case the cache infrastructure (Redis, Memcached, etc.) fails. This introduces a few new configs; the intention here is to make a "sensible default" that will work for 99% of real-world Rails apps, while adding customizability for the DIY crowd.
- Catch and handle all errors once per request.
- Remove the
rescuingblocks from the store proxies; rescuing per-method (read, write, increment) is bad because (a) it may result in undefined behavior, and (b) it will trigger repeated connection timeouts if your cache is down, e.g. N * M * timeout latency where N is the number of Rack::Attack metrics and M is the cache requests per metric. - Add
Rack::Attack.allowed_errorsconfig. This defaults to Dalli::DalliError and Redis::BaseError. - Add
Rack::Attack.failure_cooldownconfig. This temporarily disables Rack::Attack after an error occurs (including allowed errors), to prevent cache connection latency. The default is 60 seconds. - Add
Rack::Attack.error_handlerwhich takes a Proc for custom error handling. It's probably not needed but there may be esoteric use cases for it. You can also use the shortcut symbols:block,:throttle, and:allowto respond to errors using those. - Add
Rack::Attack.calling?method which uses Thread.current (or RequestStore, if available) to indicate that Rack::Attack code is executing. The reason for this is so Rails Cache error hander can raise the error for Rack::Attack to handle (i.e. if the error occurred while Rack::Attack was executing.) Refer to readme. - Add "Fault Tolerance & Error Handling" section to Readme which includes all of the above.
- Bit of refactor to main
Rack::Attackclass; code is now split into more methods.
In addition, I have added RSpec Mocks as a development dependency. Specifically for tests related to raising/handling errors, I find RSpec Mocks a heck of a lot easier to work with than MiniTest, however, if there are any MiniTest gurus out there who can suggest a pure-MiniTest way for my tests, I'd be happy to change it. Note that RSpec Mocks is added by itself without requiring all of RSpec.
Lastly, I'd like to give a special shoutout to Amazon MemoryDB for Redis for causing the failure that inspired me to write this PR.
@grzuy can you please review this PR?
FYI I've been running this in production now for several weeks with no issue.
@grzuy any consideration to merge this?
@timdiggins I've updated the PR in response to your comments. Thanks!
@grzuy any chance to merge this? I've been using in Production for 6 months.
@grzuy please review.
I love the idea of this PR. I just have one concern about it being enabled by default. If an attacker is able to influence a system to enter the bypass mode, they might be able to compromise a critical security feature in some cases.
Consider if someone uses rack-attack to throttle requests to their sign in form. They also use the same Redis cache for other application features. If an attacker is able to DOS the system so that Redis fails and rack-attack enters bypass mode, they may be able to also bypass the throttling on the sign in form.
Personally, I'd classify rack-attack as a security component, and it violates the "principal of least surprise" that a security system would automatically disable itself when under threat. I do think there are use-cases where this is useful, but it should probably be something that users need to enable explicitly once they understand what it does.
Of course, I could be wrong, and maybe there are mitigating factors here I don't understand. I'm happy to be corrected. Thanks either way!
@justinhoward Im fine to disable this behavior by default on this PR. If i do that can someone please merge?
I am very interested in this too, I'm running into problem cases similar to described in original issue -- if your store is misbehaving, it can wind up bringing your whole app down. Which is pretty disastrous. Ability to configure to in some cases just ignore rack-attack failure and proceed with app processing seems crucial.
Actually, now trying to trace what's going on in my app, I am realizing that when using Memcached/Dalli, similar behavior may actually already be built into Dalli? eg down_retry_delay reminds me of what I was looking for with failure_cooldown here...
But I think you really do want to suspend rack-attack at this higher level notwithstanding... I realize I am not sure I understand what's going on.
@grzuy please review.
@grzuy any chance of getting this into master? I just had my production app nearly taken out by a bad memcached!
Hi @johnnyshields, I think that the problem you are solving is legit, thanks for your contribution 🙌
As you said, this PR adds several features, which IMO makes it hard to be accepted and merged by maintainers. Maybe if we keep this PR as simple as possible, and separate the other features/improvements (like allowed_errors, failure_cooldown, calling?, etc) into their own separate PRs, it’d be easier to discuss each of them and get them approved. So I’m wondering: what would be the simplest solution that would solve the main issue while also taking into consideration @justinhoward’s comment?.
What comes to my mind is something like:
- Catch one error per request (to avoid the N*M timeouts)
- Allow devs to handle this exception (by default re-raise it to keep existing behavior)
Maybe this could be done by keeping the existing catch statements and re-raising a new custom tailored exception Rack::Attack::CacheError that bubbles up and is caught in the main method with something like
rescue Rack::Attack::CacheError => e
Rack::Attack.exception_handler.call(e)
@app.call(env)
end
My question is: if we do that, will it have solved the issue you experienced and the ones described in #511? What else would be missing? Does the solution has any drawback in terms of behavior, security, or any other aspect?
@santib I am happy to try to break this into a series of smaller PRs if I can get your commitment to review and merge in a timely manner.
FWIW I've served billions of requests at TableCheck in the last ~1 year using this code, with no issues.
I just had my production app nearly taken out by a bad memcached!
@henryaj welcome to the club!
@santib
Maybe this could be done by keeping the existing catch statements.. and introduce a custom tailored Rack::Attack::CacheError
That would be possible, but since the internal Dalli/RedisProxy classes would be now raising an error (instead of suppressing it as previously), from an app-owner perspective I don't think it makes a huge difference. TBH I think is desirable to have the original database-driver specific error messages, rather than catching and raising a new one. By removing all the silly rescuing { } blocks the code is both cleaner and has (slightly) better performance.
Hey, just in case, no, I'm not a maintainer.
That would be possible, but since the internal Dalli/RedisProxy classes would be now raising an error (instead of suppressing it as previously), from an app-owner perspective I don't think it makes a huge difference.
Agree, the difference is mostly for the maintainance of this gem, so that each store_proxy continues to encapsulate the specifics of each store.
Leaving here an example of what I'm referring to: https://github.com/rack/rack-attack/compare/main...santib:rack-attack:system-hardening-refactor
Apart from the introduced Rack::Attack::CacheError, the changes are:
- The fact that we let the errors bubble up from the cache stores and catch them in the main flow, allow us to not have the N*M timeouts (just one) and solve the denial of service issue.
- The
cache_error_handlerdoes nothing by default (keeping existing behavior) but allow devs to notify/log about it.
Yep I understand what you are proposing. IMHO that's actually more complex than my PR. My preference is to keep things as simple as possible and just use the native errors--less code, less bloat, no new not-yet-seen error types, etc.
That being said I will defer to the maintainers whether they prefer the bloat. If a maintainer can chime in here I'd appreciate it.
@johnnyshields if you fork rack-attack, I'll use your fork! It does not appear to me that current maintainers of rack-attack are interested in improvements.
@jrochkind it will be second gem I've forked this week 🤣 OK give me a few days I'll do it over the weekend.
It does not appear to me that current maintainers of rack-attack are interested in improvements
I will note that over on #578 one of the maintainers has been engaging with me, on that PR I submitted last year. He's had some great feedback, and some progress has been made just in this past week.
I'm not sure I'd give up on them just yet!
@jamiemccarthy it looks to me like over on #578 you have been discussing your PR with @santib .... who said here on this issue that they are not a maintainer....
@santib , if you give detailed feedback and requests for changes that sound like they are coming from a maintainer, and people invest their time in discussing and making those changes thinking if they meet your approval they will get their PR merged by you... they may be dismayed when they find out you actually have no influence over that process...
@jamiemccarthy I'm sorry if I left the wrong impression 🙏
@jrochkind Don't worry, I'll make sure to always clarify I'm not a maintainer from now on 👍
What I do think is that if we make PRs simpler they'll eventually get reviewed by maintainers. But of course you don't have to trust me.
Oops! My misunderstanding. Well, I'm not dismayed — @santib I do think your feedback has been helpful!
In any case the repo has had 8 PRs merged in October, so it doesn't seem like it's abandoned.
Thank you @johnnyshields for your work and patience on this PR.
And thank you @timdiggins, @justinhoward, @jrochkind, @henryaj, @santib and @jamiemccarthy for all the helpful comments and feedback.
Agree with @justinhoward and @santib comments in that we would want to narrow it down to the smallest changeset possible that address yours issue in production, being an opt-in setting. I would prefer, given the considerable amount of rack-attack users and maturity of the gem, that mostly nothing changes for existing users that don't opt-in.
Thanks again.
@grzuy Appreciate the comment. Let me try to break this into a series of smaller PRs.
One thing that will be inevitable is that we need to raise an error from the Redis/DalliProxy classes to the parent level. Currently this is being swallowed inside the proxy classes themselves. Two options here:
- @santib has proposed to introduce a new "wrapper" error class such as Rack::Attack::CacheError
- I am proposing to use the existing Dalli/RedisErrors, which may have more info and are easier to work with IMHO. It makes the code simpler if we do this, and we can always go @santib's way later if there is a strong reason to do so.
Do you have a strong opinion on which way we should go?
@jamiemccarthy it looks to me like over on #578 you have been discussing your PR with @santib .... who said here on this issue that they are not a maintainer....
Just for what is worth, he is a mantainer now https://github.com/rack/rack-attack/discussions/637.