apm-server icon indicating copy to clipboard operation
apm-server copied to clipboard

TBS: Add reporting to improve troubleshooting for common issues

Open isaacaflores2 opened this issue 4 months ago • 5 comments

Background

There are some common issues that prevent tail-based sampling from performing as expected. Additional reporting/observability can empower users to troubleshoot these issues on their own. Common issues:

  • storage limit reached
  • traces missing a root transaction

Potential Solutions

Metrics

  • Expose a metric to track TBS related errors. This can be a simple counter with an error label which covers the storage limit reached error

    • An alternate solution has been proposed in https://github.com/elastic/apm-server/issues/17878
  • Expose a metric which observes sampling decisions The idea is to surface a metric that can show a user scenarios when traces are missing a root transaction. This can be a counter which tracks each time a transaction group is sampled. Or a metric which can track valid unsampled traces which have a root transaction.

    • The exact metric may depend on the current TBS implementation. We should explore possible solutions as part of this issue

isaacaflores2 avatar Aug 11 '25 17:08 isaacaflores2

The storage limit reached error should already be covered by an existing metric, but the existing metric does not have a label. Not sure if this is enough for this purpose?

ericywl avatar Nov 26 '25 06:11 ericywl

Currently reading the TBS code, and I have some doubts about this. The traces that do not have root transaction sampled will be deleted by TTL, and trying to surface this as a metric would mean reading back all received traces every time we perform deletion by TTL (since we otherwise use DeleteRange to delete the entire partition, no read). I assume to properly troubleshoot TBS, we would need to surface the actual trace itself instead of just a counter. Otherwise, the user may not be able to tell what caused the trace to not be sampled for example.

This seems like a lot of overhead for unclear benefits. I think we need to clearly evaluate what's the benefit of doing this.

ericywl avatar Nov 27 '25 07:11 ericywl

Sorry for chiming in - I wanted to share my 2c on this one as it crosses the scope of https://github.com/elastic/apm-server/issues/17878.

For now, users are completely blind if TBS is actually working. The only way for them to tell if TBS is working is to:

  • Enable the APM Integration logs ingestion (in ECH for example)
  • Look to the APM Integration logs and hope to spot the error Or:
  • Enable the APM Integration monitoring (in ECH is easy, on premise is slightly more complex)
  • Create a custom dashboard to plot the metrics

A "low hanging fruit" would be to at least propagate to Fleet status errors or warnings of APM Server such as:

  • the fact that TBS has reached the storage limit (and when it recovers)
  • if we really exhausted the storage (e.g. the customer set 100GB of limit but the actual disk available is 30GB)
  • the fact there were configuration errors in TBS which prevent TBS to be enabled
  • other errors or fallbacks APM Integration might raise

In short, propagating errors to the Fleet state and not just log them. This is something other inputs have implemented over time (AWS S3, Azure Input, Filestream, etc....). It can make the life of users (and support/engineering) easier as we are not obliged to search through logs and be "lucky" the log rotation didn't kick in and threw away the log stating the storage limit was reached

For the traces missing a root transaction, I think it's not easy to do it and we can shift the focus on it later.

On the side, I think we have all the metrics we need of TBS, but we do not have any OOTB dashboard to consume them - for this I've opened https://github.com/elastic/apm-server/issues/19396

lucabelluccini avatar Nov 27 '25 11:11 lucabelluccini

@ericywl can you investigate how to propagate that status/metric to fleet?

raultorrecilla avatar Dec 09 '25 16:12 raultorrecilla

@raultorrecilla I believe that is part of another issue: https://github.com/elastic/apm-server/issues/17878.

ericywl avatar Dec 10 '25 08:12 ericywl