sentry-ruby Massive increase in transactions upgrading from 5.13.0 to 5.17.3

We did some routine upgrades on 5/24:

CleanShot 2024-06-04 at 16 34 03@2x

This was later on a Friday. When our weekday traffic picked up next week, we noticed an alarming number of transactions being logged:

CleanShot 2024-06-04 at 16 31 56@2x

This resulted in a much larger Sentry bill than we were expecting. We've been trying to trace where this came from, but we can't find anything. All the increase in transactions are coming from our API service (rails).

Our frontend is NextJS, and we also upgraded those Sentry dependencies from 7.110.1 to 8.4.0. I mention that because of our tracing setup. If the client were to send many more requests with tracing enabled, I guess that could result in a large increase in transactions.

# frozen_string_literal: true

Sentry.init do |config|
  config.dsn = Rails.application.credentials&.sentry&.fetch(:dsn)
  config.breadcrumbs_logger = [:active_support_logger, :http_logger, :sentry_logger]
  config.enabled_environments = ["production"]
  config.release = ENV["RELEASE_SHA"]

  if Sidekiq.server?
    config.excluded_exceptions.delete("ActiveRecord::RecordNotFound")
  end

  config.traces_sampler = lambda do |sampling_context|
    unless sampling_context[:parent_sampled].nil?
      next sampling_context[:parent_sampled]
    end

    transaction = sampling_context[:transaction_context]
    op = transaction[:op]

    if op.match?("sidekiq")
      0.000000000001
    else
      0.0
    end
  end
end

However I do see our client sample rate still at 2.5% when looking at traces. Example:

CleanShot 2024-06-04 at 16 38 32@2x

We're having a hard time understanding where the increase in transactions is coming from. I've looked at the releases between 5.13 and 5.17 and don't see anything obvious.

I'd love some help to understand what is happening. Either some explanation on what new transactions are being sent, where to look in Sentry to see more details, or anything else.

Thank you!

Jun 04 '24 20:06 rnystrom

@rnystrom the only thing that changed on the ruby side during those releases for transactions is more DelayedJob tracing but you don't seem to be using that.

~~Your frontend upgrade to v8 also includes our massive OpenTelemetry refactor so that could indeed cause more frontend (and thus more linked backend) transactions.~~ I confirmed with our JS team and this shouldn't ideally be the case.

I need more info to help you further:

your sentry org slug (if you don't want it visible publicly on github, please route your request through [email protected] and link this issue)
your frontend Sentry initializer config
what kind of transactions have increased in volume compared to before (you can sort by TPM in your Performance section to find high volume transactions - see picture below)

Jun 05 '24 12:06 sl0thentr0py

Thanks @sl0thentr0py! I can provide everything here in the hopes that it helps anyone else.

your sentry org slug

campsite-software

your frontend Sentry initializer config

sentry.client.config.js

import * as Sentry from '@sentry/nextjs'

const SENTRY_DSN = process.env.SENTRY_DSN || process.env.NEXT_PUBLIC_SENTRY_DSN

Sentry.init({
  dsn: SENTRY_DSN,
  beforeSend(event, hint) {
    // Check if the error is a TypeError: Failed to fetch
    if (hint.originalException instanceof TypeError && hint.originalException.message === 'Failed to fetch') {
      // Check if the URL contains "_vercel/speed-insights/vitals"
      const url = event.request.url
      if (url && url.includes('_vercel/speed-insights/vitals')) {
        // Ignore the error
        return null
      }
    }
    // If conditions are not met, return the event
    return event
  },
  ignoreErrors: [
    // a bunch of error regexes and strings from extensions and more
  ],
  denyUrls: [
    // facebook and other extension urls...
  ],
  tracePropagationTargets: ['api.campsite.test', 'api.campsite.design', 'api.campsite.co'],
  tracesSampleRate: 0.025,
  profilesSampleRate: 1.0,
  integrations: [Sentry.httpClientIntegration(), Sentry.browserTracingIntegration(), Sentry.browserProfilingIntegration()]
})

next.config.js

const path = require('path')

/**
 * @type {import('next').NextConfig}
 */

const { withSentryConfig } = require('@sentry/nextjs')

// bunch of setup and config...

/** @type {import('next').NextConfig} */
const moduleExports = {
  // next config...
  async headers() {
    return [
      // other stuff...
      {
        // Sentry Profiling
        // @see https://docs.sentry.io/platforms/javascript/profiling/#step-2-add-document-policy-js-profiling-header
        source: '/(.*)',
        headers: [
          {
            key: 'Document-Policy',
            value: 'js-profiling'
          }
        ]
      }
    ]
  },
  sentry: {
    widenClientFileUpload: true,
    hideSourceMaps: true
  },
  // webback config...
}

const sentryWebpackPluginOptions = {
  silent: true, // Suppresses all logs
  // For all available options, see:
  // https://github.com/getsentry/sentry-webpack-plugin#options.
  authToken: process.env.SENTRY_AUTH_TOKEN
}

// Make sure adding Sentry options is the last code to run before exporting, to
// ensure that your source maps include changes from all other Webpack plugins
module.exports = withBundleAnalyzer(withSentryConfig(moduleExports, sentryWebpackPluginOptions))

what kind of transactions have increased in volume compared to before (you can sort by TPM in your Performance section to find high volume transactions - see picture below)

Here's a before & after of 2-weeks of API transactions. There doesn't appear to be any one product area that increased, instead it seems to be fairly unilateral between controller actions and Sidekiq jobs. Here's a sampling:

Our Pusher job transactions increased 23x while our actual Pusher usage through May is flat
OrganizationMembersController#show 2x'd (not as big a jump but still rather large)
NotificationsController#index transactions 36x'd while our actual endpoint hits are flat
DeliverWebPushMessageJob 273x'd (our actual web pushes are flat between periods)

Side-note: I'd love to view "previous period" in a column as well. It's hard to compare results between time periods grouped by columns. The dotted line previous-period is nice but its just a total for the interval.

May 8-22	May 22-June 5

Jun 05 '24 15:06 rnystrom

thx @rnystrom, don't know the exact problem yet but some observations

Trace differences before / after

if I look at the /[org]/inbox/[inboxView] transaction before May 25, I see a much smaller trace.

compare that to a more recent trace which has 177 events in the compared to the 21 above.

So for some reason, the traces started on your frontend now are getting linked to far more Rails transactions (via the sentry-trace and baggage headers). The frontend seems to be keeping the same trace_id alive for very long.

Because you're using parent_sampled to decide your rails sampling in your traces_sampler(as you should), you will always respect the frontend's sampling decision on the Rails side and thus you have far more transactions getting sampled via that frontend sampled trace.

Further steps

I'm not 100% certain but I don't see any change on the ruby/rails side that would've caused the increase
The JS SDK changed a lot in how it does tracing with v8 because it now uses OpenTelemetry under the hood, so I will ask our frontend folks to take a look at this too
- If you want to verify this, you could downgrade the NextJS Sentry SDK back to 7.112.2

Jun 07 '24 09:06 sl0thentr0py

for cross reference, here's a running tracker of problems with NextJS in v8

https://github.com/getsentry/sentry-javascript/issues/11046

Jun 07 '24 09:06 sl0thentr0py

ok @Lms24 helped me figure out what's up.

v8 changed how long traces live here https://github.com/getsentry/sentry-javascript/issues/11599

So if your application has a page that lives for very long without navigation, your traces will stay alive for much longer with v8. This is valuable feedback on the change, so we might make it more visible or add some tweaks/config to not let traces get too big.

For now, if you're not happy with this change, you can either

just sample less if you want to use less quota (the traces are accurate, they just are much longer now)
you could do the following reset of the active trace in some kind of lifecycle hook (not navigation) which you think should be your trace boundary (this is sort of subjective to the type of application)

Sentry.getCurrentScope().setPropagationContext({
    traceId: uuid4(),
    spanId: uuid4().substring(16),
})

Jun 07 '24 10:06 sl0thentr0py

Ahh ok sounds good. I'll start by lowering the client sample significantly to start. The change makes sense, it was just quite shocking the impact it had on us (particularly our bill...). I'll report back once we deploy the change.

Jun 07 '24 16:06 rnystrom

So if I lower the client sample rate to, say, 0.0025 from 0.025 (10x lower), we will have far fewer API traces, but now we will also have much less client traces. This is a bummer as now we'll have significantly less client trace data for perf monitoring + debugging just to keep our API spend lower.

We may need to implement a custom trace boundary like you suggested, though this entire change is just quite a headache as it changes the trace sampling mental model.

Jun 07 '24 16:06 rnystrom

Hey @rnystrom, I'm really sorry for the trouble! We didn't take this decision light-hearted but rather out of the necessity for resolving the confusion from the previous behaviour in version 7 of the SDKs. We knew that there are cases where the new, page-pased trace lifetime would cause unexpected behaviour but it's not great that you're affected in this way.

As an alternative to manually resetting the trace yourself, (which I agree, is probably the best you can do for now) you could also downgrade to v7 of the NextJS SDK. It's still being supported although new features will most likely only go into v8.

We're still re-evaluating the page-based trace lifetime internally and ideally we can come up with a better approach. However, unfortunately, this entails quite a lot of changes across the Sentry product so I can't make any promises or ETAs.

Jun 10 '24 08:06 Lms24

No worries appreciate the understanding. We definitely want to stay up to date with SDKs so we can tag along with improvements and new features.

If you all have any advice on our sampling setup so that we can receive more client-side sampling without blowing up our quotas with API sampling, that'd be very helpful.

Jun 11 '24 01:06 rnystrom

If you all have any advice on our sampling setup so that we can receive more client-side sampling without blowing up our quotas with API sampling, that'd be very helpful.

You could in principle ignore parent_sampled (the client's sampling decision) sometimes and reduce the API sampling rates independently for API routes you don't care about. That just means those API transactions will be missing from the complete frontend-backend trace.

Jun 11 '24 11:06 sl0thentr0py

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you remove the label Waiting for: Community, I will leave it alone ... forever!

"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

Jul 03 '24 07:07 getsantry[bot]