apm-server icon indicating copy to clipboard operation
apm-server copied to clipboard

"Output Events Rate" in stack monitoring is always zero

Open axw opened this issue 3 years ago • 12 comments

APM Server version (apm-server version): 8.3.0-BC4

Description of the problem including expected versus actual behavior:

"Output Events Rate" in stack monitoring is always zero.

Steps to reproduce:

  1. Start 8.3.0-BC4 with stack monitoring enabled.
  2. Send some events, check that they show up in the APM UI.
  3. Navigate to stack monitoring, observe the "Output Events Rate" chart is always reporting zero.

image

axw avatar Jun 15 '22 03:06 axw

Hmm, I just reconfigured the integration with expvar enabled, and now it's working. Maybe there's race condition?

axw avatar Jun 15 '22 04:06 axw

Happened again after upgrading from 8.2.3 to 8.3.0-BC4. Initially the output was zero, after reconfiguring the integration (this time changing the event rate limit), the output went non-zero.

axw avatar Jun 15 '22 05:06 axw

This is apparently still an issue, at least in system tests, as seen here:

https://apm-ci.elastic.co/blue/organizations/jenkins/apm-server%2Fapm-server-mbp%2FPR-9014/detail/PR-9014/1/pipeline/

axw avatar Aug 31 '22 06:08 axw

I haven't been able to reproduce this exact error. However, due to the way our instrumentation works it is possible that after a reload event the old modelindexer is still receiving data while the instrumentation has moved to the new modelindexer. This is due to the fact that we wait for the old modelindexer to gracefully shutdown however, we switch the monitoring to new modelindexer before the old one exits.

The above will result in the instrumentation data to report 0 until the old indexer shuts down.

lahsivjar avatar Sep 21 '22 12:09 lahsivjar

Moving this to backlog since we haven't spend more time recently to track this down.

simitt avatar Nov 22 '22 21:11 simitt

It appears that this bug lead up to an incident (https://github.com/elastic/cloud/issues/110723) and should be prioritized

tegenterter avatar Dec 16 '22 18:12 tegenterter

Moved it into the 8.7 milestone again to be picked up and verified if this is still a bug in current versions.

simitt avatar Dec 23 '22 15:12 simitt

I don't recall if this has already been ruled out, but I realise now that I never wrote down on this issue a possible contributing factor: every time we reconfigure the server, we create a new libbeat monitoring registry: https://github.com/elastic/apm-server/blob/32a167b81356e19e9e173bb58a0503eea5e80e3d/internal/beater/beater.go#L628

axw avatar Apr 04 '23 10:04 axw

Hmm, nice catch. I don't remember any conversation around this so I think this hasn't been ruled out.

lahsivjar avatar Apr 05 '23 05:04 lahsivjar

I was looking at this today and I have 2 questions:

  1. how can I send some test data?
  2. my first hint at this would be to try reusing the libbeatMonitoringRegistry instead of creating it anew like it is done for the output registry https://github.com/elastic/apm-server/blob/32a167b81356e19e9e173bb58a0503eea5e80e3d/internal/beater/beater.go#L634-L639 What do you think?

endorama avatar May 02 '23 14:05 endorama

how can I send some test data?

You could use https://github.com/elastic/apm-server/tree/main/systemtest/cmd/sendotlp to send test data to APM Server

my first hint at this would be to try reusing the libbeatMonitoringRegistry instead of creating it anew like it is done for the output registry

You could try, but I don't think that will work. There are assumptions about there being a 1:1 relationship between metrics and outputs, e.g. here: https://github.com/elastic/apm-server/blob/98806224092aa9646d2cf8466517b0955e8476b6/internal/beater/beater.go#L688-L696

axw avatar May 08 '23 04:05 axw