Massive increase in transactions upgrading from 5.13.0 to 5.17.3
We did some routine upgrades on 5/24:
This was later on a Friday. When our weekday traffic picked up next week, we noticed an alarming number of transactions being logged:
This resulted in a much larger Sentry bill than we were expecting. We've been trying to trace where this came from, but we can't find anything. All the increase in transactions are coming from our API service (rails).
Our frontend is NextJS, and we also upgraded those Sentry dependencies from 7.110.1 to 8.4.0. I mention that because of our tracing setup. If the client were to send many more requests with tracing enabled, I guess that could result in a large increase in transactions.
# frozen_string_literal: true
Sentry.init do |config|
config.dsn = Rails.application.credentials&.sentry&.fetch(:dsn)
config.breadcrumbs_logger = [:active_support_logger, :http_logger, :sentry_logger]
config.enabled_environments = ["production"]
config.release = ENV["RELEASE_SHA"]
if Sidekiq.server?
config.excluded_exceptions.delete("ActiveRecord::RecordNotFound")
end
config.traces_sampler = lambda do |sampling_context|
unless sampling_context[:parent_sampled].nil?
next sampling_context[:parent_sampled]
end
transaction = sampling_context[:transaction_context]
op = transaction[:op]
if op.match?("sidekiq")
0.000000000001
else
0.0
end
end
end
However I do see our client sample rate still at 2.5% when looking at traces. Example:
We're having a hard time understanding where the increase in transactions is coming from. I've looked at the releases between 5.13 and 5.17 and don't see anything obvious.
I'd love some help to understand what is happening. Either some explanation on what new transactions are being sent, where to look in Sentry to see more details, or anything else.
Thank you!
@rnystrom the only thing that changed on the ruby side during those releases for transactions is more DelayedJob tracing but you don't seem to be using that.
~~Your frontend upgrade to v8 also includes our massive OpenTelemetry refactor so that could indeed cause more frontend (and thus more linked backend) transactions.~~ I confirmed with our JS team and this shouldn't ideally be the case.
I need more info to help you further:
- your sentry org slug (if you don't want it visible publicly on github, please route your request through [email protected] and link this issue)
- your frontend Sentry initializer config
- what kind of transactions have increased in volume compared to before (you can sort by TPM in your Performance section to find high volume transactions - see picture below)
Thanks @sl0thentr0py! I can provide everything here in the hopes that it helps anyone else.
your sentry org slug
campsite-software
your frontend Sentry initializer config
sentry.client.config.js
import * as Sentry from '@sentry/nextjs'
const SENTRY_DSN = process.env.SENTRY_DSN || process.env.NEXT_PUBLIC_SENTRY_DSN
Sentry.init({
dsn: SENTRY_DSN,
beforeSend(event, hint) {
// Check if the error is a TypeError: Failed to fetch
if (hint.originalException instanceof TypeError && hint.originalException.message === 'Failed to fetch') {
// Check if the URL contains "_vercel/speed-insights/vitals"
const url = event.request.url
if (url && url.includes('_vercel/speed-insights/vitals')) {
// Ignore the error
return null
}
}
// If conditions are not met, return the event
return event
},
ignoreErrors: [
// a bunch of error regexes and strings from extensions and more
],
denyUrls: [
// facebook and other extension urls...
],
tracePropagationTargets: ['api.campsite.test', 'api.campsite.design', 'api.campsite.co'],
tracesSampleRate: 0.025,
profilesSampleRate: 1.0,
integrations: [Sentry.httpClientIntegration(), Sentry.browserTracingIntegration(), Sentry.browserProfilingIntegration()]
})
next.config.js
const path = require('path')
/**
* @type {import('next').NextConfig}
*/
const { withSentryConfig } = require('@sentry/nextjs')
// bunch of setup and config...
/** @type {import('next').NextConfig} */
const moduleExports = {
// next config...
async headers() {
return [
// other stuff...
{
// Sentry Profiling
// @see https://docs.sentry.io/platforms/javascript/profiling/#step-2-add-document-policy-js-profiling-header
source: '/(.*)',
headers: [
{
key: 'Document-Policy',
value: 'js-profiling'
}
]
}
]
},
sentry: {
widenClientFileUpload: true,
hideSourceMaps: true
},
// webback config...
}
const sentryWebpackPluginOptions = {
silent: true, // Suppresses all logs
// For all available options, see:
// https://github.com/getsentry/sentry-webpack-plugin#options.
authToken: process.env.SENTRY_AUTH_TOKEN
}
// Make sure adding Sentry options is the last code to run before exporting, to
// ensure that your source maps include changes from all other Webpack plugins
module.exports = withBundleAnalyzer(withSentryConfig(moduleExports, sentryWebpackPluginOptions))
what kind of transactions have increased in volume compared to before (you can sort by TPM in your Performance section to find high volume transactions - see picture below)
Here's a before & after of 2-weeks of API transactions. There doesn't appear to be any one product area that increased, instead it seems to be fairly unilateral between controller actions and Sidekiq jobs. Here's a sampling:
- Our Pusher job transactions increased 23x while our actual Pusher usage through May is flat
- OrganizationMembersController#show 2x'd (not as big a jump but still rather large)
- NotificationsController#index transactions 36x'd while our actual endpoint hits are flat
- DeliverWebPushMessageJob 273x'd (our actual web pushes are flat between periods)
Side-note: I'd love to view "previous period" in a column as well. It's hard to compare results between time periods grouped by columns. The dotted line previous-period is nice but its just a total for the interval.
| May 8-22 | May 22-June 5 |
|---|---|
thx @rnystrom, don't know the exact problem yet but some observations
Trace differences before / after
if I look at the /[org]/inbox/[inboxView] transaction before May 25, I see a much smaller trace.
compare that to a more recent trace which has 177 events in the compared to the 21 above.
So for some reason, the traces started on your frontend now are getting linked to far more Rails transactions (via the sentry-trace and baggage headers). The frontend seems to be keeping the same trace_id alive for very long.
Because you're using parent_sampled to decide your rails sampling in your traces_sampler(as you should), you will always respect the frontend's sampling decision on the Rails side and thus you have far more transactions getting sampled via that frontend sampled trace.
Further steps
- I'm not 100% certain but I don't see any change on the ruby/rails side that would've caused the increase
- The JS SDK changed a lot in how it does tracing with v8 because it now uses OpenTelemetry under the hood, so I will ask our frontend folks to take a look at this too
- If you want to verify this, you could downgrade the NextJS Sentry SDK back to 7.112.2
for cross reference, here's a running tracker of problems with NextJS in v8
https://github.com/getsentry/sentry-javascript/issues/11046
ok @Lms24 helped me figure out what's up.
v8 changed how long traces live here https://github.com/getsentry/sentry-javascript/issues/11599
So if your application has a page that lives for very long without navigation, your traces will stay alive for much longer with v8. This is valuable feedback on the change, so we might make it more visible or add some tweaks/config to not let traces get too big.
For now, if you're not happy with this change, you can either
- just sample less if you want to use less quota (the traces are accurate, they just are much longer now)
- you could do the following reset of the active trace in some kind of lifecycle hook (not navigation) which you think should be your trace boundary (this is sort of subjective to the type of application)
Sentry.getCurrentScope().setPropagationContext({
traceId: uuid4(),
spanId: uuid4().substring(16),
})
Ahh ok sounds good. I'll start by lowering the client sample significantly to start. The change makes sense, it was just quite shocking the impact it had on us (particularly our bill...). I'll report back once we deploy the change.
So if I lower the client sample rate to, say, 0.0025 from 0.025 (10x lower), we will have far fewer API traces, but now we will also have much less client traces. This is a bummer as now we'll have significantly less client trace data for perf monitoring + debugging just to keep our API spend lower.
We may need to implement a custom trace boundary like you suggested, though this entire change is just quite a headache as it changes the trace sampling mental model.
Hey @rnystrom, I'm really sorry for the trouble! We didn't take this decision light-hearted but rather out of the necessity for resolving the confusion from the previous behaviour in version 7 of the SDKs. We knew that there are cases where the new, page-pased trace lifetime would cause unexpected behaviour but it's not great that you're affected in this way.
As an alternative to manually resetting the trace yourself, (which I agree, is probably the best you can do for now) you could also downgrade to v7 of the NextJS SDK. It's still being supported although new features will most likely only go into v8.
We're still re-evaluating the page-based trace lifetime internally and ideally we can come up with a better approach. However, unfortunately, this entails quite a lot of changes across the Sentry product so I can't make any promises or ETAs.
No worries appreciate the understanding. We definitely want to stay up to date with SDKs so we can tag along with improvements and new features.
If you all have any advice on our sampling setup so that we can receive more client-side sampling without blowing up our quotas with API sampling, that'd be very helpful.
If you all have any advice on our sampling setup so that we can receive more client-side sampling without blowing up our quotas with API sampling, that'd be very helpful.
You could in principle ignore parent_sampled (the client's sampling decision) sometimes and reduce the API sampling rates independently for API routes you don't care about. That just means those API transactions will be missing from the complete frontend-backend trace.
This issue has gone three weeks without activity. In another week, I will close it.
But! If you comment or otherwise update it, I will reset the clock, and if you remove the label Waiting for: Community, I will leave it alone ... forever!
"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀