More gracefully handle situations with many events & running into rate limits
Problem Statement
👋 Hi and first of all for maintaining a wonderful client library for elixir 💜
TLDR;
- Many exceptions to report
- run over quota --> reporting fails
- sentry keeps using ~8 Gigabyte in memory trying to report them over days
(I decided to call it an improvement, depending on PoV some may call it bug 😅 )
We ran into an incident some days ago which caused a lot of exception events. This caused us to run over our sentry quota (despite Spike protection being enabled, but also fair!).
From what I'm seeing all the events were were retained in memory and continuously tried to be submitted to sentry. This kept failing as we were over the quota, so we also had a bunch of these log messages:
[warning] Failed to send Sentry event. Sentry failed to report event: Sentry responded with status 429 - Too Many Requests and the SDK exhausted the configured retries
The symptom on the system were a bunch of Sentry.Transport.Sender processes with very high reduction counts and significant memory consumption and big message queues (as reported by :observer_cli):
|1 |<0.113277224| 111.0986 MiB |Elixir.IEx.Evaluator:init/5 | 9821477| 0 |Elixir.IEx.Evaluator:loop/1 |
|2 |<0.3783.0> | 20.5259 MiB |'Elixir.Sentry.Transport.Sender' | 117164527| 558 |Elixir.Process:sleep/1 |
|3 |<0.3809.0> | 18.3961 MiB |'Elixir.Sentry.Transport.Sender' | 125430590| 1810 |Elixir.Process:sleep/1 |
|4 |<0.3827.0> | 18.3959 MiB |'Elixir.Sentry.Transport.Sender' | 102370638| 1821 |Elixir.Process:sleep/1 |
|5 |<0.3793.0> | 18.3955 MiB |'Elixir.Sentry.Transport.Sender' | 103264516| 1809 |Elixir.Process:sleep/1 |
|6 |<0.3772.0> | 18.3935 MiB |'Elixir.Sentry.Transport.Sender' | 102460005| 1776 |Elixir.Process:sleep/1 |
|7 |<0.3825.0> | 18.3913 MiB |'Elixir.Sentry.Transport.Sender' | 120544828| 1763 |Elixir.Process:sleep/1 |
|8 |<0.3800.0> | 18.3869 MiB |'Elixir.Sentry.Transport.Sender' | 105581947| 1702 |Elixir.Process:sleep/1 |
|9 |<0.3789.0> | 18.3837 MiB |'Elixir.Sentry.Transport.Sender' | 123818720| 1676 |Elixir.Process:sleep/1 |
|10 |<0.3778.0> | 18.3829 MiB |'Elixir.Sentry.Transport.Sender' | 118361900| 1655 |Elixir.Process:sleep/1 |
|11 |<0.3817.0> | 18.3754 MiB |'Elixir.Sentry.Transport.Sender' | 121602601| 1559 |Elixir.Process:sleep/1 |
|12 |<0.3769.0> | 18.3735 MiB |'Elixir.Sentry.Transport.Sender' | 112488249| 1551 |Elixir.Process:sleep/1 |
|13 |<0.3794.0> | 15.7664 MiB |'Elixir.Sentry.Transport.Sender' | 110978452| 1880 |Elixir.Process:sleep/1 |
|14 |<0.3802.0> | 15.7609 MiB |'Elixir.Sentry.Transport.Sender' | 116827056| 1824 |Elixir.Process:sleep/1 |
|15 |<0.3804.0> | 15.7569 MiB |'Elixir.Sentry.Transport.Sender' | 112633395| 1767 |Elixir.Process:sleep/1 |
|16 |<0.3767.0> | 15.7569 MiB |'Elixir.Sentry.Transport.Sender' | 108640217| 1778 |Elixir.Process:sleep/1 |
|17 |<0.3826.0> | 15.7494 MiB |'Elixir.Sentry.Transport.Sender' | 119471892| 1682 |Elixir.Process:sleep/1 |
|18 |<0.3812.0> | 15.7478 MiB |'Elixir.Sentry.Transport.Sender' | 110992238| 1657 |Elixir.Process:sleep/1 |
|19 |<0.3818.0> | 15.7443 MiB |'Elixir.Sentry.Transport.Sender' | 111055539| 1629 |Elixir.Process:sleep/1 |
|20 |<0.3797.0> | 15.7374 MiB |'Elixir.Sentry.Transport.Sender' | 105904350| 1527 |Elixir.Process:sleep/1 |
|21 |<0.3801.0> | 15.7369 MiB |'Elixir.Sentry.Transport.Sender' | 120051518| 1541 |Elixir.Process:sleep/1 |
|22 |<0.3819.0> | 15.7360 MiB |'Elixir.Sentry.Transport.Sender' | 107081411| 1530 |Elixir.Process:sleep/1
(only an excerp from one of the not as affected pods).
At this point the incident was ~2.5 days ago but all these processes seemed to never succeed in submitting the events and hence kept them around. You can clearly see the impact here in our pod that was affected the worst:
I omitted the concrete time frame but it's ~2.5 days. The graphed metric is beam_memory_allocated_bytes.
You can see the memory going up, metric reporting malfunctioning due to incident and then the memory staying at a consistently high amount until it suddenly came crashing down. The time it came crashing down is when I ran Supervisor.terminate_child(Sentry.Supervisor, Sentry.Transport.SenderPool) on the system. So I'm 100% sure this memory increase is directly related to these Sentry processes.
Admittedly those were exceptional circumstances but I wouldn't expect sentry to potentially contribute to an incident/outage by hogging this much memory unsuccessfully trying to resubmit exceptions 😅
Environment
elixir -v
Erlang/OTP 27 [erts-15.2.6] [source] [64-bit] [smp:14:14] [ds:14:14:10] [async-threads:1] [jit]
Elixir 1.18.3 (compiled with Erlang/OTP 27)
mix deps | grep sentry
* sentry 11.0.4 (Hex package) (mix)
locked at 11.0.4 (sentry) feaafc28
Our sentry config is nothing special I believe:
config :sentry,
dsn: sentry_dsn,
environment_name: environment,
enable_source_code_context: true,
root_source_code_paths: [File.cwd!()],
integrations: [
oban: [
capture_errors: true,
cron: [enabled: true]
]
]
(I did need to write custom code to make error reporting from gRPC work, but I don't think that's too relevant here - in the end it calls Sentry.capture_exception and is modeled after the plugs)
It's a phoenix application with ecto etc... it does have gRPC but otherwise nothing too fancy I'd say.
Solution Brainstorm
I'm not too sure, my expectation was that eventually sentry would give up and just throw away the events as it failed to submit them too many times or too many events are outstanding on the client side.
The best comparison I have is the protections the erlang users to protect the handler from overload. It tracks message que length (among others) and enters drop or flush modes to get rid of extra messages/avoid getting overloaded.
Especially when the sentry error message is a 429 with quota exceeded I think it's the right call to drop some events. I'm not sure if the client can know that the monthly quote was exceeded, but if it can esp. then I think it's probably good to drop events.
I know, not losing events is a noble goal but as is it's a bit counter-productive.
As an aside, in case someone runs into this situation, a sentry client function to help you flush events would be nice. I first tried to kill the Sentry.Supervisor but that took the entire application down - whoopsie my fault 😅 But yeah, took a while to find the child to terminate and do that. A helper may be nice.
I know I can be rambly, sorry, in conclusion:
- a client side limit for events being processes similar to Logger overflow protections may be good, to also limit memory impact on applications
- if quota exceeded is detected, dropping even more events instead of continously trying to submit them may be good
- a helper functions to remove in process events may be neat for people live debugging a system
Support bunny
@PragTob hiya! Thanks for the bunny. I think we should go ahead and implement load shedding and thresholds like we do for Sentry.LoggerHandler:
https://github.com/getsentry/sentry-elixir/blob/master/lib/sentry/logger_handler.ex#L65-L107
However, this falls in the hands of the Sentry team to decide. I don't know if they're ok with effectively dropping errors here.
@whatyouhide thank you! Yeah exactly that's what I'd be looking for! Sounds like a great solution! 🚀
We're seeing this same behavior in our system, but mostly with transactions rather than errors. We have very spiky traffic where we're processing relatively few requests per minute for most of the day, but then we'll spike for a couple minutes and then have increased traffic for a while afterwards:
We've been seeing OOMs after a few of these spikes, (in the example spike above, one pod OOMed right before 15:00) and after looking into it more, it appears that they're being triggered mainly by the Sentry transport pool getting backed up. Unfortunately I don't have any data from the pods that have OOMed, but here's the stats from a fresher one:
So the Sentry.Transport.Sender processes are using 1G out of the 1.6G of memory in use on that pod. And we're seeing the rate limited requests in the Sentry dashboard.
This got me thinking about a few different things:
- Does the ingest endpoint allow multiple transactions to be provided as part of a single envelope? It looks like right now we're putting a single event in each envelope, and I'm wondering if we could receive multiple messages from the process mailbox and send them together as a batch to clear the memory faster. I'm guessing I could look at some of the other SDK implementations to get an answer to that.
- For us the ideal would be to shed spans, but not shed exception. When we let the pod OOM, we're losing all the events anyways so we figure we need to shed some load, and it would be ideal if we could prioritize the event types to shed. It looks like the Logger implementation @whatyouhide posted would be a good model for that.
We're thinking about a few workarounds on our side:
- Update the persistent_term for the config to change the sample rate when we see the queue lengths climbing
- If queue length gets really bad, just kill a couple of the Sender processes to free up some memory
Would there be any downsides to either of these?
I spent some more time digging into this today, and found a couple different things in the SDK docs on Sentry's website. And I have a proposal at the end.
Research
1. Dropping errors
First was this bit about rate limit enforcement: https://develop.sentry.dev/sdk/expected-features/rate-limiting/#rate-limit-enforcement
SDKs are expected to honor 429 responses and rate limits communicated from Sentry by stopping data transmission until the rate limit has expired. Events, transactions, sessions, and/or any other supported payload type, depending on the communicated limits, are to be discarded during the period that the rate limit is in place.
So it looks like in response to @whatyouhide's question about whether the Sentry team is ok with dropping errors here, it seems like that is the supported behavior.
2. Backpressure for traces
This is maybe less related to the original issue since it's meant specifically for trace spans, but it looks like the SDKs are meant to manage backpressure as well: https://develop.sentry.dev/sdk/telemetry/traces/backpressure
So specifically there should be a health check monitor that checks every 10 seconds:
- if the background worker queue is full
- any rate limits are currently active
If the system is unhealthy, a downsample_factor should double every 10 seconds until the system is back in a healthy state.
3. Separate rate limits
Per the rate limiting page, the Sentry API will specify different rate limits per category. So then rate limiting the sending should scope the rate limiting to those categories as well. If we're getting rate limited on traces, errors should still be able to go through.
This won't be a big deal for clients that are only sending errors to Sentry, but as the SDK supports more of Sentry's features, this will become more important.
Answers to my thoughts from last week
- It appears that only a single trace span should be included in each envelope.
- It looks like the recommended approach is to shed load only for the categories that are being rate limited, which matches my thinking here.
- My first workaround idea for changing our sample rate when load gets too high seems to be the blessed approach from the Sentry team.
Proposal
I propose splitting this work into two changes to the SDK to match the recommended behavior:
1. Drop events instead of waiting
Right now we use Process.sleep to wait out the time until the rate limit has reset. I propose that we create a new ETS table owned by Sentry.Transport where we can store the timestamp after which we can start sending for each of the categories. Then in Sentry.Transport.client_post_and_validate_return_value/4 we should add a call to a new handle_rate_limits/1 function that takes the headers. If X-Sentry-Rate-Limits is present, it should update the rate limits in ets. And in Sentry.Transport.request/4 if we got a 429 we should also update the ets table. But the big thing here is that if we got a 429 we should return {:error, :rate_limited} and then discard the event instead of holding onto it until we can start sending again.
2. Set up backpressure for traces
For this we can basically just follow the documentation: https://develop.sentry.dev/sdk/telemetry/traces/backpressure
We would probably need to wait until we are storing the rate limit information so that we will be able to check that in a health check process, and then we would need to have the sample rate take the downsample_rate into account.
Feedback?
I'm happy to take on this work because this is important for us at Surge, but I would love to know if there's any feedback on the proposal before I start.
@PragTob I'm sorry you had such experience and thanks for this detailed report! In addition to implementing fixes and improvements already mentioned here, I'm also going to look into why spike protection didn't kick in for you.
@dnsbty thanks for the research and proposal! A PR with event dropping change would be fantastic as the first step. Please let me know if you're still interested in working on this 💜
@solnic 👋 my hunch is that spike protection did kick in, but since the BEAM is so awesome sentry kept retrying them over a longer time and hence the spike was broken. If you need the project id or some other info please let me know!
@solnic I'll work on that PR and send it over sometime in the next week or two.