TBS: Add reporting to improve troubleshooting for common issues
Background
There are some common issues that prevent tail-based sampling from performing as expected. Additional reporting/observability can empower users to troubleshoot these issues on their own. Common issues:
- storage limit reached
- traces missing a root transaction
Potential Solutions
Metrics
-
Expose a metric to track TBS related errors. This can be a simple counter with an error label which covers the
storage limit reachederror- An alternate solution has been proposed in https://github.com/elastic/apm-server/issues/17878
-
Expose a metric which observes sampling decisions The idea is to surface a metric that can show a user scenarios when traces are missing a root transaction. This can be a counter which tracks each time a transaction group is sampled. Or a metric which can track valid unsampled traces which have a root transaction.
- The exact metric may depend on the current TBS implementation. We should explore possible solutions as part of this issue
The storage limit reached error should already be covered by an existing metric, but the existing metric does not have a label. Not sure if this is enough for this purpose?
Currently reading the TBS code, and I have some doubts about this. The traces that do not have root transaction sampled will be deleted by TTL, and trying to surface this as a metric would mean reading back all received traces every time we perform deletion by TTL (since we otherwise use DeleteRange to delete the entire partition, no read). I assume to properly troubleshoot TBS, we would need to surface the actual trace itself instead of just a counter. Otherwise, the user may not be able to tell what caused the trace to not be sampled for example.
This seems like a lot of overhead for unclear benefits. I think we need to clearly evaluate what's the benefit of doing this.
Sorry for chiming in - I wanted to share my 2c on this one as it crosses the scope of https://github.com/elastic/apm-server/issues/17878.
For now, users are completely blind if TBS is actually working. The only way for them to tell if TBS is working is to:
- Enable the APM Integration logs ingestion (in ECH for example)
- Look to the APM Integration logs and hope to spot the error Or:
- Enable the APM Integration monitoring (in ECH is easy, on premise is slightly more complex)
- Create a custom dashboard to plot the metrics
A "low hanging fruit" would be to at least propagate to Fleet status errors or warnings of APM Server such as:
- the fact that TBS has reached the storage limit (and when it recovers)
- if we really exhausted the storage (e.g. the customer set 100GB of limit but the actual disk available is 30GB)
- the fact there were configuration errors in TBS which prevent TBS to be enabled
- other errors or fallbacks APM Integration might raise
In short, propagating errors to the Fleet state and not just log them. This is something other inputs have implemented over time (AWS S3, Azure Input, Filestream, etc....). It can make the life of users (and support/engineering) easier as we are not obliged to search through logs and be "lucky" the log rotation didn't kick in and threw away the log stating the storage limit was reached
For the traces missing a root transaction, I think it's not easy to do it and we can shift the focus on it later.
On the side, I think we have all the metrics we need of TBS, but we do not have any OOTB dashboard to consume them - for this I've opened https://github.com/elastic/apm-server/issues/19396
@ericywl can you investigate how to propagate that status/metric to fleet?
@raultorrecilla I believe that is part of another issue: https://github.com/elastic/apm-server/issues/17878.