apm-server icon indicating copy to clipboard operation
apm-server copied to clipboard

Investigate nightly benchmarks 0 events/s issue

Open carsonip opened this issue 1 year ago • 2 comments
trafficstars

Nightly benchmarks occasionally report 0 events/s. Investigate the root cause of it.

carsonip avatar Jul 22 '24 11:07 carsonip

Status update

The first thing I looked at was what was getting reported by the benchmark failures. Here are 2 links to the benchmark run:

  1. Run with events/sec metric populated - Link to APM-Server logs - Link to deployment
  2. Run without events/sec metric populated - Link to APM-Server logs - Link to deployment

Both of these show 500 internal error, however, the logs for 0 events/sec additionally show data validation errors due to unexpected EOF. These errors seemed to be logged from here. This could be an issue with our sender, however, the most intriguing thing is why only a subset of delta metrics are reported as 0. For example: in the above link, the txn/sec and metrics/sec are reported correctly whereas other delta metrics are reported as zero.

I have tried reproducing the errors locally but haven't succeeded (note that the expvar metrics collection is designed for benchtimes in minutes so if testing locally make sure that you have a good enough benchtime to give expvar metrics to work correctly). I did see some special handling in the expvar metric collection but nothing explains this bug.

I have also created a PR to log errors in expvar endpoint which was not done before. I am not sure how helpful it will be though.

lahsivjar avatar Jul 29 '24 13:07 lahsivjar

Is this still happening?

simitt avatar Oct 16 '24 18:10 simitt

@simitt I had this happen to me in a run on GH Actions last week, see Slack Thread: https://elastic.slack.com/archives/C95SB62AG/p1729263104854879

rubvs avatar Oct 23 '24 15:10 rubvs

moving this out of iteration, to backlog. If it happens again more frequently we can reprioritize. Out of curiosity, how many times this happened between end of October and now? cc @rubvs

raultorrecilla avatar Jan 10 '25 23:01 raultorrecilla

@raultorrecilla it happens very infrequently. For reference, I've probably ran 300+ benchmarks since Oct, and have only observed this 3 or 4 times.

rubvs avatar Jan 11 '25 00:01 rubvs

~~It happened today to an on-demand benchmark run. Also updated the issue description.~~

edit: there seems to be a real issue with main

carsonip avatar Jan 22 '25 16:01 carsonip

resolved in https://github.com/elastic/apm-server/pull/15338

1pkg avatar Jan 23 '25 20:01 1pkg

resolved in https://github.com/elastic/apm-server/pull/15338

@1pkg #15338 resolved a real issue that causes reproducible 0 events/s.

This GH issue was created for some occasional 0 events/s that auto resolves itself. However, I'm happy to keep it closed and reopen it when needed.

carsonip avatar Jan 23 '25 21:01 carsonip

Reopening as it is happening quite frequently lately.

carsonip avatar Jan 27 '25 10:01 carsonip

moving it as a candidate for next iteration (it-107)

raultorrecilla avatar Jan 27 '25 15:01 raultorrecilla

IMO we should treat this with urgency, ideally one of the on-support duty engineers could look into it. I believe @1pkg started investigation on it as part of support duty last week and could hand over.

simitt avatar Jan 28 '25 10:01 simitt

Let's wait for https://github.com/elastic/apm-server/pull/15360#issuecomment-2616830145 to be resolved and see if this still happens afterwards.

simitt avatar Jan 28 '25 10:01 simitt

See https://github.com/elastic/apm-server/pull/15360#issuecomment-2618999903

There might be more to this. The expvar endpoint is working fine and when running the benchmarks locally the numbers are reported correctly (events is not 0).

kruskall avatar Jan 28 '25 13:01 kruskall

started a benchmark targeting main without the last 3 commits and it seems to work: https://github.com/elastic/apm-server/actions/runs/13012315722

tip at https://github.com/elastic/apm-server/commit/1e021dba8e6df3078626fd10d608788795ed7103

kruskall avatar Jan 28 '25 14:01 kruskall

started another benchmark targeting main without the last commit and it works too: https://github.com/elastic/apm-server/actions/runs/13012675235/job/36293952533

tip at https://github.com/elastic/apm-server/commit/8aac9e384b24bf899b9a98d5d7ac6e2fcabf572c

kruskall avatar Jan 28 '25 14:01 kruskall

ok we're running the main branch with and without moxy.

with moxy: everything works and events are not 0 without moxy: 0 events/s

the commit/tip is exactly the same

kruskall avatar Jan 28 '25 15:01 kruskall

with moxy: everything works and events are not 0 without moxy: 0 events/s

I can reproduce this. Specifically, in the case of 0 events/s (without moxy), the ESS ES has all apm documents indexed. There is no error. It is the libbeat.output.* that are missing from /debug/vars:

(truncated output)

"apm-server.processor.stream.accepted": 5257954,
"apm-server.processor.stream.errors.invalid": 0,
"apm-server.processor.stream.errors.toolarge": 0,
"apm-server.processor.span.transformations": 3173610,
"apm-server.processor.error.transformations": 105356,
"apm-server.processor.transaction.transformations": 1214548,
"libbeat.config.scans": 0,
"libbeat.config.reloads": 0,
"libbeat.config.module.starts": 0,
"libbeat.config.module.stops": 0,
"libbeat.config.module.running": 0,
"badger_v2_blocked_puts_total": 0,
"badger_v2_disk_reads_total": 0,
"badger_v2_disk_writes_total": 0,

carsonip avatar Jan 28 '25 15:01 carsonip

@kruskall and I were trying to narrow down the root cause. Here's the evidence of why I believe the regression was caused by #15360 :

  • main with moxy: non-0 events/s: https://github.com/elastic/apm-server/actions/runs/13013064320/job/36295274110
  • main without moxy: 0 events/s: https://github.com/elastic/apm-server/actions/runs/13013066590/job/36295281127
  • main with #15360 reverted, with moxy: non-0 events/s: https://github.com/elastic/apm-server/actions/runs/13013938309/job/36298252171
  • main with #15360 reverted, without moxy: non-0 events/s: https://github.com/elastic/apm-server/actions/runs/13013940799/job/36298261750

carsonip avatar Jan 28 '25 15:01 carsonip

should be fixed by https://github.com/elastic/apm-server/pull/15439

benchmarks using the pr branch are working fine

kruskall avatar Jan 28 '25 19:01 kruskall

PR merged. Closing this

kruskall avatar Jan 29 '25 11:01 kruskall