ci: system tests failing due to disk usage limit reached

Open kruskall opened this issue 4 months ago • 0 comments

APM Server version (apm-server version):

Description of the problem including expected versus actual behavior:

Recently apm-server CI has been failing more often, the cause seems to be a flaky test: https://github.com/elastic/apm-server/issues/19198

but the frequency of failure has spiked to the point that it's almost impossible to merge a PR.

Upon further investigation it seems we're reaching the disk usage limit causing apm-serve to bypass tbs and index events by default hence failing the tbs test.

The ubuntu-latest image of the running is already significantly bloated and has ~75% disk usage (note: we set tbs disk usage limit to ~80% of available disk)

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including server configuration, agent(s) used, etc. The easier you make it for us to reproduce it, the more likely that somebody will take the time to look at it.

open PR
watch system test fail

Provide logs (if relevant):

{"log.level":"warn","@timestamp":"2025-11-20T14:43:33.872Z","log.logger":"beater.sampling","log.origin":{"function":"github.com/elastic/apm-server/x-pack/apm-server/sampling.(*Processor).ProcessBatch","file.name":"sampling/processor.go","file.line":127},"message":"processing trace failed, indexing by default","service.name":"apm-server","error":{"message":"disk usage threshold 0.80: configured limit reached (current: 63403094016, limit: 61509723750)"},"ecs.version":"1.6.0"}

Nov 20 '25 19:11 kruskall