sentry-java icon indicating copy to clipboard operation
sentry-java copied to clipboard

Sentry metrics appear to be sampled and therefore low-volume metrics frequently don't appear at Sentry servers

Open rafalmemyard opened this issue 1 year ago • 9 comments

Environment

SaaS (https://sentry.io/)

Steps to Reproduce

Run a Java service with Sentry SDK v7.9.0 Have a piece of code that logs higher-volume metrics and lower-volume metrics Notice that while higher-volume metrics generally appear in Sentry dashboards, lower-volume metrics rarely make it.

Expected Result

Both high-volume and low-volume metrics are sent and are not sampled on the client side.

Actual Result

High volume metrics appear in the dashboard, although at much lower rates than the incidence of those metrics. Low volume metrics frequently don't appear on the dashboard nor in the list of emitted metrics under "Settings > {Org} > {Project} > Metrics". Sometimes those low-volume metrics appear days after the initial metric has been emitted.

Additionally, both types of metrics appear to be sampled (on the client?) since the values seen in the Sentry dashboards are much lower than from estimating event incidence by our logs.

Here's a sample of the events for Google Drive API "exportSizeLimitExceeded" error from our logs:

(base) backend-server-dev:~/backend-server-dev$ zgrep -c exportSizeLimitExceeded logs/backend.2024-0* | egrep -v ':0$'
logs/backend.2024-05-29.0.log.zip:6
logs/backend.2024-05-31.0.log.zip:4
logs/backend.2024-06-03.0.log.zip:4
logs/backend.2024-06-04.0.log.zip:2

Whereas here's the screenshot of the Sentry dashboard showing that metrics for the last 14d -- note that it only shows up once on May 31st

Screenshot 2024-06-13 at 10 30 15 AM

Likewise here are some counts from our preview environment of the number of files transferred via the Google Drive API from logs vs. the Sentry metric; note that the max counts/day are 200+ for the high-volume days where as our max value is 71 for that date range.

(base) backend-server-dev:~/backend-server-dev$ zgrep -c 'Transferring file' logs/backend.2024-0* |egrep -v ':0$'
logs/backend.2024-05-31.0.log.zip:257
logs/backend.2024-06-03.0.log.zip:217
logs/backend.2024-06-04.0.log.zip:30
logs/backend.2024-06-05.0.log.zip:21
logs/backend.2024-06-07.0.log.zip:44
logs/backend.2024-06-11.0.log.zip:105
logs/backend.2024-06-12.0.log.zip:128

Screenshot 2024-06-13 at 10 37 03 AM

Note that I've run the Java SDK with debug enabled, and the only messages re: metrics I see are the following types of messages; per your support team I was told I should see the names of captured metrics here but that doesn't appear to be the case for any of the environments I've enabled Sentry debug in:

DEBUG: Metrics: nothing to flush
DEBUG: Metrics: nothing to flush
DEBUG: Metrics: flushing 1 buckets
DEBUG: Metrics: capturing metrics
DEBUG: Envelope sent successfully.
DEBUG: Envelope flushed

Product Area

DDM

Link

https://memyard.sentry.io/metrics/?interval=1d&project=4507091072385024&statsPeriod=30d&utc=true&widgets=%5B%7B%22mri%22%3A%22c%3Acustom%2Fgoogle_drive.transfer_file.start%40none%22%2C%22op%22%3A%22sum%22%2C%22query%22%3A%22%22%2C%22groupBy%22%3A%5B%5D%2C%22powerUserMode%22%3Afalse%2C%22type%22%3A1%2C%22id%22%3A0%2C%22displayType%22%3A%22line%22%2C%22focusedSeries%22%3A%5B%5D%2C%22sort%22%3A%7B%22order%22%3A%22asc%22%7D%2C%22isHidden%22%3Afalse%2C%22overlays%22%3A%5B%22samples%22%5D%7D%2C%7B%22mri%22%3A%22c%3Acustom%2Fgoogle_drive.transfer_file.success%40none%22%2C%22op%22%3A%22sum%22%2C%22query%22%3A%22%22%2C%22groupBy%22%3A%5B%5D%2C%22powerUserMode%22%3Afalse%2C%22type%22%3A1%2C%22id%22%3A1%2C%22displayType%22%3A%22line%22%2C%22focusedSeries%22%3A%5B%5D%2C%22sort%22%3A%7B%22order%22%3A%22asc%22%7D%2C%22isHidden%22%3Afalse%2C%22overlays%22%3A%5B%22samples%22%5D%7D%2C%7B%22mri%22%3A%22c%3Acustom%2Fgoogle_drive.transfer_file.failure%40none%22%2C%22op%22%3A%22sum%22%2C%22query%22%3A%22%22%2C%22groupBy%22%3A%5B%5D%2C%22powerUserMode%22%3Afalse%2C%22type%22%3A1%2C%22id%22%3A2%2C%22displayType%22%3A%22line%22%2C%22focusedSeries%22%3A%5B%5D%2C%22sort%22%3A%7B%22order%22%3A%22asc%22%7D%2C%22isHidden%22%3Afalse%2C%22overlays%22%3A%5B%22samples%22%5D%7D%2C%7B%22mri%22%3A%22d%3Acustom%2Fstorage_service.train_drive.available_files%40count%22%2C%22op%22%3A%22sum%22%2C%22query%22%3A%22%22%2C%22groupBy%22%3A%5B%5D%2C%22powerUserMode%22%3Afalse%2C%22type%22%3A1%2C%22id%22%3A3%2C%22displayType%22%3A%22line%22%2C%22focusedSeries%22%3A%5B%5D%2C%22sort%22%3A%7B%22order%22%3A%22asc%22%7D%2C%22isHidden%22%3Afalse%2C%22overlays%22%3A%5B%22samples%22%5D%7D%2C%7B%22mri%22%3A%22d%3Acustom%2Fstorage_service.train_drive.indexable_files%40count%22%2C%22op%22%3A%22sum%22%2C%22query%22%3A%22%22%2C%22groupBy%22%3A%5B%5D%2C%22powerUserMode%22%3Afalse%2C%22type%22%3A1%2C%22id%22%3A4%2C%22displayType%22%3A%22line%22%2C%22focusedSeries%22%3A%5B%5D%2C%22sort%22%3A%7B%22order%22%3A%22asc%22%7D%2C%22isHidden%22%3Afalse%2C%22overlays%22%3A%5B%22samples%22%5D%7D%2C%7B%22mri%22%3A%22d%3Acustom%2Fstorage_service.train_drive.transferred_files%40count%22%2C%22op%22%3A%22sum%22%2C%22query%22%3A%22%22%2C%22groupBy%22%3A%5B%5D%2C%22powerUserMode%22%3Afalse%2C%22type%22%3A1%2C%22id%22%3A5%2C%22displayType%22%3A%22line%22%2C%22focusedSeries%22%3A%5B%5D%2C%22sort%22%3A%7B%22order%22%3A%22asc%22%7D%2C%22isHidden%22%3Afalse%2C%22overlays%22%3A%5B%22samples%22%5D%7D%2C%7B%22mri%22%3A%22c%3Acustom%2Fgoogle_drive.transfer_file.failure.file_not_found%40none%22%2C%22op%22%3A%22sum%22%2C%22query%22%3A%22%22%2C%22groupBy%22%3A%5B%5D%2C%22powerUserMode%22%3Afalse%2C%22type%22%3A1%2C%22id%22%3A6%2C%22displayType%22%3A%22line%22%2C%22focusedSeries%22%3A%5B%5D%2C%22sort%22%3A%7B%22order%22%3A%22asc%22%7D%2C%22isHidden%22%3Afalse%2C%22overlays%22%3A%5B%22samples%22%5D%7D%2C%7B%22mri%22%3A%22c%3Acustom%2Fgoogle_drive.transfer_file.failure.export_size_limit_exceeded%40none%22%2C%22op%22%3A%22sum%22%2C%22query%22%3A%22%22%2C%22groupBy%22%3A%5B%5D%2C%22powerUserMode%22%3Afalse%2C%22type%22%3A1%2C%22id%22%3A7%2C%22displayType%22%3A%22line%22%2C%22focusedSeries%22%3A%5B%5D%2C%22sort%22%3A%7B%22order%22%3A%22asc%22%7D%2C%22isHidden%22%3Afalse%2C%22overlays%22%3A%5B%22samples%22%5D%7D%5D

DSN

https://8f29233cca683e7b0a93e6a2b859bb17@o4506853392973824.ingest.us.sentry.io/4507091072385024

Version

No response

┆Issue is synchronized with this Jira Improvement by Unito

rafalmemyard avatar Jun 13 '24 14:06 rafalmemyard

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] avatar Jun 13 '24 14:06 getsantry[bot]

Routing to @getsentry/product-owners-ddm for triage ⏲️

getsantry[bot] avatar Jun 13 '24 18:06 getsantry[bot]

Thank you for raising this, we're looking into it 🙏

matejminar avatar Jun 14 '24 12:06 matejminar

@rafalmemyard Thanks for reaching out! The Java SDK does not sample metrics on the client side. Only if (sentry) server side rate limiting is in effect, metrics could get dropped. But as far as I can see that's not the case for you.

Regarding the log messages: We do not log individual metrics, as this could be overwhelming, but rather log a summary on every flush. If you emit n different metrics you should see a log similar to this:

DEBUG: Metrics: flushing <n> buckets

Maybe that's something you can verify?

Apart from that and depending on your setup you might need to call Sentry.close() which will flush any unsent metrics.

markushi avatar Jun 17 '24 06:06 markushi

@rafalmemyard Thanks for reaching out! The Java SDK does not sample metrics on the client side. Only if (sentry) server side rate limiting is in effect, metrics could get dropped. But as far as I can see that's not the case for you.

OK, thanks for checking on that.

Regarding the log messages: We do not log individual metrics, as this could be overwhelming, but rather log a summary on every flush. If you emit n different metrics you should see a log similar to this:

DEBUG: Metrics: flushing <n> buckets

Maybe that's something you can verify?

I do see those messages, but the number of buckets is always 1, even though I expect I should see multiple metrics flushed in some cases where I log '....started', '....completed', '....error.' metrics when calling the Google Drive APIs in this case (see example code in the linked support ticket).

I also never see more than 1 of the flushing <n> buckets messages in a row, generally they look like this:

DEBUG: Metrics: nothing to flush
DEBUG: Metrics: nothing to flush
DEBUG: Metrics: flushing 1 buckets
DEBUG: Metrics: capturing metrics
DEBUG: Metrics: nothing to flush
DEBUG: Metrics: flushing 1 buckets
DEBUG: Metrics: capturing metrics
DEBUG: Metrics: nothing to flush
DEBUG: Metrics: flushing 1 buckets
DEBUG: Metrics: capturing metrics
DEBUG: Metrics: nothing to flush
DEBUG: Metrics: flushing 1 buckets
DEBUG: Metrics: capturing metrics
DEBUG: Metrics: nothing to flush
DEBUG: Metrics: flushing 1 buckets
DEBUG: Metrics: capturing metrics
DEBUG: Metrics: nothing to flush
DEBUG: Metrics: flushing 1 buckets
DEBUG: Metrics: capturing metrics
DEBUG: Metrics: nothing to flush
DEBUG: Metrics: flushing 1 buckets

Is there some significance to the DEBUG: capturing metrics message? Ie, it should have been capturing metrics the whole time.

Apart from that and depending on your setup you might need to call Sentry.close() which will flush any unsent metrics.

I assume this shuts down the SDK totally, and since this is a long-running service that just won't do. It appears that I could maybe call Sentry.flush() at the end of some of the functions I call in a thread pool -- but at least some of the JavaDocs say the method is unimplemented in Java SDK 7.9.0 (though it looks like the internals call it and the definitions I found don't look like they throw any unsupported exceptions). Thoughts on this?

rafalmemyard avatar Jun 17 '24 13:06 rafalmemyard

@rafalmemyard Thanks for the additional details, this sounds awfully similar to this issue which appeared in 7.9.0

Could you try the previous "working" Java SDK version 7.8.0?

markushi avatar Jun 17 '24 21:06 markushi

@rafalmemyard Thanks for the additional details, this sounds awfully similar to this issue which appeared in 7.9.0

Could you try the previous "working" Java SDK version 7.8.0?

Will definitely try downgrading to 7.8.0 and report -- I think I initially saw some issues with 7.8.0 and then tried to upgrade in hopes things would get better, but that ticket does hit very close to what I'm seeing.

rafalmemyard avatar Jun 17 '24 22:06 rafalmemyard

@rafalmemyard Thanks for the additional details, this sounds awfully similar to this issue which appeared in 7.9.0 Could you try the previous "working" Java SDK version 7.8.0?

Will definitely try downgrading to 7.8.0 and report -- I think I initially saw some issues with 7.8.0 and then tried to upgrade in hopes things would get better, but that ticket does hit very close to what I'm seeing.

I still think I'm seeing way fewer metrics with 7.8.0 than I should -- @markushi can you let me know when 7.11.0 (or whichever version has your fix from https://github.com/getsentry/sentry-java/pull/3493) ships?

rafalmemyard avatar Jun 28 '24 23:06 rafalmemyard

@rafalmemyard we're aiming to get the fix out early this week. I'll keep you posted!

markushi avatar Jul 01 '24 06:07 markushi

@rafalmemyard 7.11.0 is out, could you give it a try and let us know if this got fixed?

romtsn avatar Jul 02 '24 07:07 romtsn

I've just done one test locally with 7.11.0 and it's already looking better than 7.8.0 or later versions. Will deploy to our staging apps to verify that things look better there as well and re-ping you all, but the initial results look good!

rafalmemyard avatar Jul 02 '24 21:07 rafalmemyard

I also noticed this not just for metrics but for regular events. I am using python sdk in this case. See reported issue https://github.com/getsentry/relay/issues/3873

Thinking this is a relay issue, but haven't yet looked into the sdk itself. Will be looking into python sdk next .

chipzzz avatar Jul 26 '24 22:07 chipzzz

@chipzzz thanks for chiming in and the detailed report. I believe you're experiencing a different issue which is tracked in https://github.com/getsentry/relay/issues/3873 and I'm gonna close this one since the fix resolved the problems reported here. In case something related to this comes up again, feel free to comment here.

kahest avatar Jul 29 '24 08:07 kahest