apm-agent-java
apm-agent-java copied to clipboard
span composite count must be > 1
Describe the bug
From the APM Agents spec (https://github.com/elastic/apm/blob/main/specs/agents/handling-huge-traces/tracing-spans-compress.md?plain=1#L137-L138):
composite
-count
: The number of compressed spans this composite span represents. The minimum count is 2 as a composite span represents at least two spans.
It looks like the java agent is reporting composite spans with a lower count. The following error was observed in ecs logs:
validation error: span: composite: 'count': validation rule 'min(2)' violated
Steps to reproduce
I don't have those unfortunately :(
This is part of the ops KPI review
Expected behavior
Composite count must be > 1
Debug logs
validation error: span: composite: 'count': validation rule 'min(2)' violated
By looking through the code, this seems to only be possible by a very specific race-condition. Given the following trace scenario:
- Transaction parent
- Span DB1 (compression egligable)
- Span DB2 (compression egligable)
- Span Async1 (not compression egligable)
Now the following things need to happen:
-
DB1
is ended. It is not sent to the APM-server but buffered onparent
for future arriving compressed spans -
DB2
andAsync1
are ended in parallel at the same time a.DB2
initializes the compression-datastruction onDB1
this causescount=1
b.Async1
sees that it cannot compress withDB1
and causesDB1
to be reported immediately c.DB2
increments thecount
onDB1
due to compression. This doesn't have any effect becauseDB1
is already sent.
I think what we could do is guard the compression process with a WriterReaderPhaser
: During compression DB2
would register itself as Writer
on DB1
before performing any mutations. The reporting-thread would swap the phase before reporting a span, waiting for pending compressions to end.
This however opens up additional things to consider: what do we do when a span attempts compression, but the phase of the WriterReaderPhaser
indicates that the span he wants to compress with has already been sent? E.g. retry a fixed amount of times?
I'd love to have some input from @elastic/apm-agent-java (and maybe @felixbarny and @eyalkoren as original authors) here. Is it known (and accepted) that the span-compression is prone to lost-update race conditions?
As a bandaid fix to prevent spans from being rejected by the APM-server I'll add a guard to the serialization logic so that we don't report spans with count=1
as compressed.
Span compression is purely for improved UI and I don't think it's worth a lot of effort to handle edge cases in high contention cases other than to just drop the odd compression rather than add load. So I think the proposed band aid is sufficient