rollbar-gem [Rollbar] Reporting internal error encountered while sending data to Rollbar

We're using Rollbar 3.1.2 with a Rails 6 app

We had a few errors which were sent to Rollbar, however there was a bunch of them so we probably hit the rate limit. This wasn't a problem before, our sidekiq retries them and has built-in back-off etc. However, this time we saw a long tail of internal errors. This is just one sample for the same request_id ... looks like each failing request ended up generating up to thousands of those errors, so this kinda snowballed

This is what our error log looked like for the errors

{"host":"***.com","application":"Semantic Logger","environment":"production","timestamp":"2021-03-13T09:09:43.006120Z","level":"error","level_index":4,"pid":6528,"thread":"puma threadpool 001","file":"/app/309/vendor/bundle/ruby/2.6.0/gems/rollbar-3.1.2/lib/rollbar/logger_proxy.rb","line":28,"named_tags":{"request_id":"0419751b0d862b2c0de41817cc5f2e9e","ip":"**.**.**.**","datadog":{"trace_id":2926332308409059493,"span_id":691112431983741319,"env":"production","service":"***","version":null}},"name":"Rails","message":"[Rollbar] Reporting internal error encountered while sending data to Rollbar."}

Eventually, I'm guessing as we got less and less 429s the errors died down, but a brief DB outage of a few seconds ended up generating errors for a couple of hours.

Nothing too special about our Rollbar config. We use config.use_sidekiq with a sidekiq_threshold = 3.

I don't think this happened before, but we don't have many DB issues that cause us to hit the rate limit freqnetly, so it's not something I can easily tell.

Mar 13 '21 15:03 gingerlime

trying to simulate this on our staging environment, I'm not entirely sure what caused those internal errors ... I was able to simulate a rate limit on staging, but didn't get internal errors... Any tips on what might be causing these errors? also on how to simulate them in our development environment perhaps?

Mar 15 '21 07:03 gingerlime

These could be caused by failed access to the Rollbar API, or by anything that caused rollbar-gem to fail internally. If any of the internal errors are visible in your Rollbar dashboard, those can be reviewed to better understand what happened.

There is logic in Rollbar's Sidekiq plugin to limit those kinds of error storms, and that logic was improved a while back. Are you running the latest rollbar-gem in your job runner processes?

Mar 15 '21 13:03 waltjones

Thanks @waltjones, yeah we run the latest rollbar version on puma + sidekiq. I'm still not sure what causes these internal errors. I could kinda reproduce, but only when our database was doing a fail-over, so it's not something I can just easily trigger and control the timeframe of.

I'm not sure I spotted anything on the Rollbar dashboard besides that "regular" errors about the database connection though... Where can I find them?

Mar 15 '21 13:03 gingerlime

If they're there, you'd see them in the items list with your other errors. Maybe what happened is it only logged and didn't attempt to send, which means maybe the intended mitigation worked.

Mar 15 '21 14:03 waltjones

Nothing in our items list besides the "real" errors.

Maybe what happened is it only logged and didn't attempt to send, which means maybe the intended mitigation worked.

Not sure I completely follow. I still wonder why those errors were logged, and we cannot seem to find something else in the log which could indicate why an internal error happened / was logged... Is it really useful to just log a "internal error encountered" without attaching other details about the specific nature of the internal error? (and ideally on the same log message, so it's not split across multiple log entries, which makes it harder to pinpoint as well).

Mar 15 '21 15:03 gingerlime