apm-server icon indicating copy to clipboard operation
apm-server copied to clipboard

Reconfiguring tail-based sampling leads to event loss

Open axw opened this issue 2 years ago • 2 comments

APM Server version (apm-server version): 8.3.0 (main)

Description of the problem including expected versus actual behavior:

When APM Server has tail-based sampling enabled, and its policies are modified or TBS is disabled, then the hot-reload of APM Server's configuration leads to events being dropped.

When the server is reconfigured, it will internally start a new processor pipeline and a listener which sends requests through to that pipeline. It should then close the old network listener, wait for existing HTTP requests to be processed, and then stop the existing processor pipeline.

During the reconfiguration, we're seeing the error processor is stopped, which is logged when events are processed by a stopped tail-based sampling processor. This suggests that the processor is being stopped before in-flight events have all been processed.

Steps to reproduce:

  1. Run apm-server with tail-based sampling enabled, with a single catch-all policy: [{"sample_rate":0.5}]
  2. Send a continuous stream of transactions
  3. Disable tail-based sampling
  4. Observe processor is stopped in the logs, and the rate of transactions docs drops

axw avatar Jun 08 '22 06:06 axw

Re-opening, as we haven't written a regression test for the behavior

stuartnelson3 avatar Jun 20 '22 08:06 stuartnelson3

The working branch can be found here: https://github.com/elastic/apm-server/compare/main...marclop:apm-server:add-test-for-eventloss-tbs-reconfigure?expand=1. I've commented the code thoroughly.

marclop avatar Jul 19 '22 09:07 marclop