apm-server
apm-server copied to clipboard
Investigate nightly benchmarks 0 events/s issue
Nightly benchmarks occasionally report 0 events/s. Investigate the root cause of it.
Status update
The first thing I looked at was what was getting reported by the benchmark failures. Here are 2 links to the benchmark run:
- Run with events/sec metric populated - Link to APM-Server logs - Link to deployment
- Run without events/sec metric populated - Link to APM-Server logs - Link to deployment
Both of these show 500 internal error, however, the logs for 0 events/sec additionally show data validation errors due to unexpected EOF. These errors seemed to be logged from here. This could be an issue with our sender, however, the most intriguing thing is why only a subset of delta metrics are reported as 0. For example: in the above link, the txn/sec and metrics/sec are reported correctly whereas other delta metrics are reported as zero.
I have tried reproducing the errors locally but haven't succeeded (note that the expvar metrics collection is designed for benchtimes in minutes so if testing locally make sure that you have a good enough benchtime to give expvar metrics to work correctly). I did see some special handling in the expvar metric collection but nothing explains this bug.
I have also created a PR to log errors in expvar endpoint which was not done before. I am not sure how helpful it will be though.
Is this still happening?
@simitt I had this happen to me in a run on GH Actions last week, see Slack Thread: https://elastic.slack.com/archives/C95SB62AG/p1729263104854879
moving this out of iteration, to backlog. If it happens again more frequently we can reprioritize. Out of curiosity, how many times this happened between end of October and now? cc @rubvs
@raultorrecilla it happens very infrequently. For reference, I've probably ran 300+ benchmarks since Oct, and have only observed this 3 or 4 times.
~~It happened today to an on-demand benchmark run. Also updated the issue description.~~
edit: there seems to be a real issue with main
resolved in https://github.com/elastic/apm-server/pull/15338
resolved in https://github.com/elastic/apm-server/pull/15338
@1pkg #15338 resolved a real issue that causes reproducible 0 events/s.
This GH issue was created for some occasional 0 events/s that auto resolves itself. However, I'm happy to keep it closed and reopen it when needed.
Reopening as it is happening quite frequently lately.
moving it as a candidate for next iteration (it-107)
IMO we should treat this with urgency, ideally one of the on-support duty engineers could look into it. I believe @1pkg started investigation on it as part of support duty last week and could hand over.
Let's wait for https://github.com/elastic/apm-server/pull/15360#issuecomment-2616830145 to be resolved and see if this still happens afterwards.
See https://github.com/elastic/apm-server/pull/15360#issuecomment-2618999903
There might be more to this. The expvar endpoint is working fine and when running the benchmarks locally the numbers are reported correctly (events is not 0).
started a benchmark targeting main without the last 3 commits and it seems to work: https://github.com/elastic/apm-server/actions/runs/13012315722
tip at https://github.com/elastic/apm-server/commit/1e021dba8e6df3078626fd10d608788795ed7103
started another benchmark targeting main without the last commit and it works too: https://github.com/elastic/apm-server/actions/runs/13012675235/job/36293952533
tip at https://github.com/elastic/apm-server/commit/8aac9e384b24bf899b9a98d5d7ac6e2fcabf572c
ok we're running the main branch with and without moxy.
with moxy: everything works and events are not 0 without moxy: 0 events/s
the commit/tip is exactly the same
with moxy: everything works and events are not 0 without moxy: 0 events/s
I can reproduce this. Specifically, in the case of 0 events/s (without moxy), the ESS ES has all apm documents indexed. There is no error. It is the libbeat.output.* that are missing from /debug/vars:
(truncated output)
"apm-server.processor.stream.accepted": 5257954,
"apm-server.processor.stream.errors.invalid": 0,
"apm-server.processor.stream.errors.toolarge": 0,
"apm-server.processor.span.transformations": 3173610,
"apm-server.processor.error.transformations": 105356,
"apm-server.processor.transaction.transformations": 1214548,
"libbeat.config.scans": 0,
"libbeat.config.reloads": 0,
"libbeat.config.module.starts": 0,
"libbeat.config.module.stops": 0,
"libbeat.config.module.running": 0,
"badger_v2_blocked_puts_total": 0,
"badger_v2_disk_reads_total": 0,
"badger_v2_disk_writes_total": 0,
@kruskall and I were trying to narrow down the root cause. Here's the evidence of why I believe the regression was caused by #15360 :
- main with moxy: non-0 events/s: https://github.com/elastic/apm-server/actions/runs/13013064320/job/36295274110
- main without moxy: 0 events/s: https://github.com/elastic/apm-server/actions/runs/13013066590/job/36295281127
- main with #15360 reverted, with moxy: non-0 events/s: https://github.com/elastic/apm-server/actions/runs/13013938309/job/36298252171
- main with #15360 reverted, without moxy: non-0 events/s: https://github.com/elastic/apm-server/actions/runs/13013940799/job/36298261750
should be fixed by https://github.com/elastic/apm-server/pull/15439
benchmarks using the pr branch are working fine
PR merged. Closing this