daml
daml copied to clipboard
Rate limit triggers
At the moment, triggers make it incredibly easy to DoS your ledger. The two most common causes that I’ve seen are
- A buggy trigger that keeps submitting commands that fail either due to contention or actual failures.
- A trigger that (e.g. because it has been down for a while and only just started up) has a lot of issues to catch up with and ends up submitting potentially thousands of commands in the first rule execution.
There are a few different options we have:
- Limit commands in flight. This should be relatively easy and helps at least with 2.
- On failures, in particular backpressure but probably not limited to that, back off exponentially.
I think we should also revisit the current back pressure handling in triggers. I’m not entirely sure it’s working properly and it’s definitely not tested.
Acceptance Criteria
- [ ] a configurable rate limiting mechanism for triggers is implemented and tested
- [ ] users can see when rate limiting kicks in from the logs
- [ ] documentation is provided to clarify
- [ ] what may may cause rate limiting to kick in (trigger anti-patterns, backlog catch-up, etc.)
- [ ] how to diagnose
- [ ] how to solve
- [ ] what may may cause rate limiting to kick in (trigger anti-patterns, backlog catch-up, etc.)
My current activities: https://github.com/DACH-NY/canton/issues/9832
Exponential backoff on the trigger side would definitely help a lot. I would also consider to add a "slow start", so that triggers start submitting at a low rate (as opposed to start at max rate).
Canton already has resource limits. I would not call them perfect, but if they are insufficient, I would prefer if we could improve Canton's resource management instead of creating resource management for triggers.
https://docs.daml.com/canton/usermanual/console.html#resources-set-resource-limits
Yeah I think we should introduce just enough that in combination with Canton’s resource management we don’t DoS the ledger.
@cocreature I see you assigned this to the Language team. Just to confirm, do you think this is something that should/could be handled as part of triggers? Or should it rather be handled as part of the trigger service?
Given the current implementation, this has to be handled within triggers not within the trigger service. So unless you’re planning to significantly change how the two interact, it’s language team.
@cocreature I added the acceptance criteria for this ticket as we talked about yesterday. Does this make sense?
Makes sense, I’ve extended it by two points:
- The rate limiting should be configurable, i.e., don’t hardcode 100 or whatever for some limit somewhere. This is going to come back to bite us later.
- It should be logged when rate limiting kicks in. Arguably part of "how to diagnose" of your docs part but seems useful to spell it out explicitly.
Totally agree with both points. I left logging implicit precisely because I can't see how you can document diagnosing this without any form of observability. I also don't want to dictate exactly how observability is implemented, but if you want to add the requirement explicit, I'll leave it up to you and @remyhaemmerle-da to be it part of the acceptance criteria explicitly.
FYI - This issue is important. Within the last few days, we had two clients step on this, leading to quite some debugging / support effort: DBAG triggers killed Canton (resource limits were turned off) and Xpansiv triggers killed canton in Hub.
The fix have been reverted by #15700