[meta] Tail-based sampling (TBS) improvements
This is a meta-issue on tail-based sampling.
Tail-based sampling comes up frequently in bug reports, as there is minimal documentation and guidance on TBS configuration. It is not clear to users how TBS works, which leads to misconfigured TBS storage size, and consequently apm-server and ES issues.
When TBS local storage (badger) is filled, it results in error in writing traces (where apm-server logs received error writing sampled trace: configured storage limit reached (current: 127210377485, limit: 126000000000)) and bypassing TBS as sampling rate jumps to 100%, causing a performance cliff and downstream effects: surprising significant increase on writes to ES, and either slowing ES and causing backpressure to apm-server, or unexpected high storage usage in ES.
The task list contains tasks to either document it properly, investigate/fix bugs, and to provide escape hatches for compromises.
Impact: TBS is a popular feature among heavy apm-server users who rely on TBS to reduce ES storage requirements while retaining the value of the sampled traces. We need to ensure and show that TBS is good for high load, like the rest of apm-server.
Tasks
- [x] https://github.com/elastic/apm-server/issues/11346
- [x] https://github.com/elastic/apm-server/issues/11127
- [ ] https://github.com/elastic/apm-server/issues/13525
- [x] https://github.com/elastic/apm-server/issues/14923
- [x] https://github.com/elastic/apm-server/issues/11546
- [x] https://github.com/elastic/apm-server/issues/14933
- [x] https://github.com/elastic/apm-server/issues/14996
- [x] https://github.com/elastic/apm-server/issues/15121
- [x] https://github.com/elastic/apm-server/issues/15246
- [x] https://github.com/elastic/apm-server/issues/15500
- [x] https://github.com/elastic/apm-server/issues/14247
- [ ] https://github.com/elastic/apm-server/issues/15330
- [x] https://github.com/elastic/apm-server/issues/14760
- [ESS/ECE only] Ability to see the disk size on Integration servers on the fly (even better, the available live disk usage) in Admin Console https://github.com/elastic/cloud/issues/128879
- Mitigation until then: guide users to know what is the disk size via documentation pointers
- [ESS priority] Ability to automatically set the TBS max disk usage in the Integration policy as percentage of the whole disk OR set it automatically to a sane max value and freeze it (so the customer cannot exceed the maximum)
- [ESS/ECE and on-premise] Ability to monitor TBS disk-related metrics on self-hosted APM Servers, Integration Servers and Integration Servers in ESS via at least a Dashboard (likely not possible to add new graphs to Stack Monitoring). The dashboards could be shipped with the
apminput package or with the Elastic Agent- [ALL], make sure the necessary metrics are shipped https://github.com/elastic/apm-server/issues/14247 and available for search & aggregations
- [ESS/ECE only], the prerequisite is to enable Metrics shipping via L&M on the deployment. This has to be documented.
- [On-premise], the prerequisite is to put in place a dedicated Metricbeat to monitor the Integration Server, which is odd
- It would be great to have this integrated with the monitoring of all the other components via the EA Monitoring collection instead of relying on an external Metricbeat. I do not get why we are able to collect metrics from Filebeat, Metricbeat and other components, but not APM Server.
- An alternative, might be to develop a
beatsintegration able to collect monitoring data reusing the Metricbeat module https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-beat.html
- An alternative, might be to develop a
- [DOCS] The instructions we give here are no more necessary. Since 8.15 it is possible to customize the Elastic Agent policy to set
agent.monitoring.http.enabledtotrue(but not for theElastic Cloud agent policy) via the dedicated Monitoring settings. See screenshot. Opened issue at https://github.com/elastic/docs-content/issues/1859
- It would be great to have this integrated with the monitoring of all the other components via the EA Monitoring collection instead of relying on an external Metricbeat. I do not get why we are able to collect metrics from Filebeat, Metricbeat and other components, but not APM Server.
- [ALL] Once the metrics are shipped, it would be nice to provide an out-of-the-box alert if disk is getting full due to TBS or hit the soft limit in order to be aware when APM Server will let through all the transactions.
moving this to it106 as TBS changes are still not closed
Here's my take on which tasks should be handled in the current iteration and which ones could be moved to another iteration if not feasible to tackle them all.
it-106:
- https://github.com/elastic/apm-server/issues/11546
- https://github.com/elastic/apm-server/issues/14933 (part of
9.0)
it-107:
- https://github.com/elastic/apm-server/issues/11346
- https://github.com/elastic/apm-server/issues/13525 (should become part of
9.0, but is not a breaking change) - https://github.com/elastic/apm-server/issues/14996
- https://github.com/elastic/apm-server/issues/15121
@raultorrecilla some of the subtasks aren't groomed yet, they need to be added to an iteration.
This post has been edited following the comments at https://github.com/elastic/apm-server/issues/14931#issuecomment-2698432593
- TBS storage enhancements
- Mitigation of TBS storage limit exceeded indefinitely (for APM Server & APM Integration) https://github.com/elastic/apm-server/pull/15106 ✅ (8.16.3 / 8.17.1 / 8.18.0 / 9.*)
- Publish Known Issue in Support Portal https://support.elastic.co/knowledge/747275ab ✅
- Allow to be editable via APM Integration Settings https://github.com/elastic/apm-server/issues/13525 ❓ (FUTURE)
- Document
sampling.tail.ttlfor APM Server standalone & APM Integration ❓ (FUTURE)
- How to monitor TBS
- Make sure fields are mapped https://github.com/elastic/apm-server/issues/14247 ✅ (8.16.4 / 8.17.2 / 8.18.0 / 9.*)
- Publish Knowledge Article https://support.elastic.co/knowledge/ed1fd420 ✅
- Document how to monitor/see the metrics and what they mean https://github.com/elastic/apm-server/issues/14996 (FUTURE)
- Expose the soft limit as metric in order to plot it easily https://github.com/elastic/apm-server/issues/15533
- Discard on write failures option
sampling.tail.discard_on_write_failurewhen soft limit is reached- Introduce
sampling.tail.discard_on_write_failurehttps://github.com/elastic/apm-server/issues/11127 ✅ (8.16.3 / 8.17.1 / 8.18.0 / 9.*) - Allow to be editable via APM Integration Settings https://github.com/elastic/apm-server/issues/15330 (FUTURE)
- Document the setting for both APM Server standalone & Integration https://github.com/elastic/apm-server/issues/15330 (FUTURE)
- Document what happens when TBS reaches soft limit https://github.com/elastic/apm-server/pull/11663/files ✅
- Publish known issue https://support.elastic.co/knowledge/7f7c822c ✅
- Introduce
- How to make APM monitoring easier for on-premise APM Integrations
- Update documentation to make use of the new Fleet settings https://github.com/elastic/observability-docs/pull/4749 (FUTURE)
- Switch from Badger to Pebble
- Switch & Bench migration to Pebble https://github.com/elastic/apm-server/pull/15235 ✅ (9.0.0)
- Document braking change in 9.0 https://github.com/elastic/apm-server/issues/15546 (ongoing)
- Make TBS soft limit a % of the disk
- TBS soft limit threshold set to 0 will default to 80% on the disk where the data directory is located https://github.com/elastic/apm-server/issues/14933 ✅ (9.0.0)
@lucabelluccini thanks, that pretty much sums it up.
Introduction of TTL of TBS storage Introduce it at APM Server standalone https://github.com/elastic/apm-server/pull/15106 ✅ (8.16.3 / 8.17.1 / 8.18.0 / 9.*)
One comment on this first point. sampling.tail.ttl has been an undocumented config in apm server since forever, as is the "TTL". Therefore the "Introduction of TTL" is not entirely accurate.
However, it is possible (actually relatively likely) that TTL isn't strictly enforced badger TBS implementation (i.e. any version before 9.0). In the worst case, the failure to enforce TTL caused storage limit to be exceeded indefinitely. #15106 added a mitigation to address this edge case (storage limit exceeded indefinitely) of an edge case (TTL isn't strictly enforced) by dropping the db as it won't be valuable after TTL anyway.
Also, this mitigation is not limited to standalone, so "Introduce it at APM Server standalone" is also not accurate.
Allow to be editable via APM Integration Settings https://github.com/elastic/apm-server/issues/13525 (ongoing) Document sampling.tail.ttl for APM Server standalone & APM Integration (ongoing)
The other 2 points on making them configurable in integration & documentation are fair, but my guess is if we're going forward with it, 9.1 will be the earliest version with them.
Thank you @carsonip for keeping me honest.
The comment I've made is more a recap to make sense by topic of all the changes we've made and when they're shipped (and associated docs).
I'm trying to make sure we capture some of the items above in Knowledge Articles or Known issues.
I've updated the comment. If you find additional errors, feel free to edit them.
The "Introduce it at APM Server standalone" was indeed misleading as I didn't associate sampling.tail.ttl with the mitigation PR. I thought it was another safeguard of TBS.
For sampling.tail.ttl, will it make sense to document it as Pebble (9.0+) will enforce it ( ❓ ) or do we want to keep this setting hidden?
I've also replaced ongoing by FUTURE.
For sampling.tail.ttl, will it make sense to document it as Pebble (9.0+) will enforce it ( ❓ ) or do we want to keep this setting hidden?
In 9.0 (pebble implementation), users should assume that any traces that were ingested for at least TTL ago may be deleted. This is the same assumption as badger implementation.
However, the difference is that in 9.0 implementation detail, the lifetime of these entries are bounded by 2*TTL, instead of unbounded in badger. I'm not so sure if we need to document this implementation detail.
What user should care about TTL, and what we should document is:
- in TBS, if a trace duration is longer than TTL, the trace may be broken. (same in <9.0 and 9.0)
- storage size requirement scales proportionately to TTL (generally correct <9.0 and guaranteed in 9.0). We should actually add more detail to this and say that storage size scales proportionate to
apm server trace ingest throughput * TTL.
Closing this meta issue as all attached issues are closed.