apm
apm copied to clipboard
APM UI doesn't show connected traces sent from AWS Distro for OTel (Lambda)
APM Server version (apm-server version
): 7.13.1
Description of the problem including expected versus actual behavior:
I managed to get the distributed tracing going with AWS X-Ray and the OTel collector to pick it up. The flow is curl => API-GW => producer-lambda => API-GW => consumer-lambda
. The code (serverless): https://github.com/michaelhyatt/serverless-aws-python-api-worker-otel
The interesting thing is that we are recording the 2 transactions with 1 span correctly as having the same trace.id
, however, the APM UI doesn’t really show a proper trace.
AWS X-Ray showing the trace with 2 components:
Discover shows 3 records, 2 transactions and 1 span with the same trace.id
:
The APM UI is showing 2 services, but no traces:
Steps to reproduce: As per README, requires the serverless framework.
JSON documents
Consumer transaction
{
"_index": "apm-7.13.1-transaction-000001",
"_type": "_doc",
"_id": "TdX7OHoBmWidblu2OYlN",
"_version": 1,
"_score": null,
"fields": {
"transaction.name.text": ["handler.consumer"],
"service.framework.version": ["0.16.dev0"],
"observer.name": ["instance-0000000008"],
"labels.faas_name": ["aws-python-api-worker-project-dev-consumer"],
"service.language.name": ["python"],
"labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-consumer"],
"transaction.sampled": [true],
"transaction.id": ["1b5693f6e07de9c7"],
"trace.id": ["60d331f41b8d383766d158e55359aaca"],
"processor.event": ["transaction"],
"agent.name": ["opentelemetry/python"],
"labels.faas_execution": ["c7ab9723-9e43-4488-89d0-e1759190b984"],
"event.outcome": ["unknown"],
"cloud.region": ["ap-southeast-2"],
"service.name": ["aws-python-api-worker-project-dev-consumer"],
"service.framework.name": ["opentelemetry.instrumentation.aws_lambda"],
"processor.name": ["transaction"],
"transaction.duration.us": [698],
"labels.faas_version": ["$LATEST"],
"observer.version_major": [7],
"observer.hostname": ["60e5fd4e9651"],
"transaction.type": ["custom"],
"cloud.provider": ["aws"],
"event.ingested": ["2021-06-23T13:07:06.188Z"],
"timestamp.us": [1624453624947661],
"@timestamp": ["2021-06-23T13:07:04.947Z"],
"ecs.version": ["1.8.0"],
"observer.type": ["apm-server"],
"observer.version": ["7.13.1"],
"agent.version": ["1.2.0"],
"parent.id": ["8ae9f83b97976607"],
"transaction.name": ["handler.consumer"]
},
"highlight": {
"trace.id": ["@kibana-highlighted-field@60d331f41b8d383766d158e55359aaca@/kibana-highlighted-field@"]
},
"sort": [1624453624947]
}
Producer HTTP span
{
"_index": "apm-7.13.1-span-000001",
"_type": "_doc",
"_id": "T9X7OHoBmWidblu2OYlN",
"_version": 1,
"_score": null,
"fields": {
"span.destination.service.type": ["external"],
"span.name": ["HTTP GET"],
"destination.port": [443],
"observer.name": ["instance-0000000008"],
"labels.faas_name": ["aws-python-api-worker-project-dev-producer"],
"trace.id": ["60d331f41b8d383766d158e55359aaca"],
"span.duration.us": [2256175],
"processor.event": ["span"],
"agent.name": ["opentelemetry/python"],
"destination.address": ["3y00jobl0c.execute-api.ap-southeast-2.amazonaws.com"],
"event.outcome": ["success"],
"cloud.region": ["ap-southeast-2"],
"service.name": ["aws-python-api-worker-project-dev-producer"],
"processor.name": ["transaction"],
"labels.faas_version": ["$LATEST"],
"span.id": ["898690921573f428"],
"span.subtype": ["http"],
"observer.version_major": [7],
"observer.hostname": ["60e5fd4e9651"],
"span.type": ["external"],
"cloud.provider": ["aws"],
"timestamp.us": [1624453622779541],
"@timestamp": ["2021-06-23T13:07:02.779Z"],
"ecs.version": ["1.8.0"],
"observer.type": ["apm-server"],
"observer.version": ["7.13.1"],
"agent.version": ["1.2.0"],
"parent.id": ["8b8e0c642d96e5f2"],
"span.destination.service.name": ["https://3y00jobl0c.execute-api.ap-southeast-2.amazonaws.com"],
"span.destination.service.resource": ["3y00jobl0c.execute-api.ap-southeast-2.amazonaws.com:443"]
},
"highlight": {
"trace.id": ["@kibana-highlighted-field@60d331f41b8d383766d158e55359aaca@/kibana-highlighted-field@"]
},
"sort": [1624453622779]
}
Producer transaction
{
"_index": "apm-7.13.1-transaction-000001",
"_type": "_doc",
"_id": "TtX7OHoBmWidblu2OYlN",
"_version": 1,
"_score": null,
"fields": {
"transaction.name.text": ["handler.producer"],
"service.framework.version": ["0.21b0"],
"observer.name": ["instance-0000000008"],
"labels.faas_name": ["aws-python-api-worker-project-dev-producer"],
"service.language.name": ["python"],
"labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-producer"],
"transaction.sampled": [true],
"transaction.id": ["8b8e0c642d96e5f2"],
"trace.id": ["60d331f41b8d383766d158e55359aaca"],
"processor.event": ["transaction"],
"agent.name": ["opentelemetry/python"],
"labels.faas_execution": ["fdcc0a5f-ce57-4201-a631-6097957f4a78"],
"event.outcome": ["unknown"],
"cloud.region": ["ap-southeast-2"],
"service.name": ["aws-python-api-worker-project-dev-producer"],
"service.framework.name": ["opentelemetry.instrumentation.requests"],
"processor.name": ["transaction"],
"transaction.duration.us": [2259030],
"labels.faas_version": ["$LATEST"],
"observer.version_major": [7],
"observer.hostname": ["60e5fd4e9651"],
"transaction.type": ["custom"],
"cloud.provider": ["aws"],
"event.ingested": ["2021-06-23T13:07:06.189Z"],
"timestamp.us": [1624453622778742],
"@timestamp": ["2021-06-23T13:07:02.778Z"],
"ecs.version": ["1.8.0"],
"observer.type": ["apm-server"],
"observer.version": ["7.13.1"],
"agent.version": ["1.2.0"],
"parent.id": ["918243f3ba6e3f71"],
"transaction.name": ["handler.producer"]
},
"highlight": {
"trace.id": ["@kibana-highlighted-field@60d331f41b8d383766d158e55359aaca@/kibana-highlighted-field@"]
},
"sort": [1624453622778]
}
@michaelhyatt would you be able to run the opentelemetry-collector with the logging exporter, and capture the logged spans? Then we can compare to the docs we're recording.
One thing I can see is that the consumer transaction's parent.id
does not refer to either of the other events. Also, the producer transaction has a parent.id
- so neither of the transactions is considered the root, hence why they don't show up under Traces. I'm guessing that the API Gateway is updating traceparent, but there doesn't appear to be any spans for API Gateway. I'm wildly guessing that API Gateway is sending XRay events but not OTel spans.
@axw I have the debug logs of the embedded OTel collector that recorded the messages it sends. 2 logs are attached. Here is the example of the debug of the span that the consumer reports:
2021-06-23T23:00:34.750Z DEBUG loggingexporter/logging_exporter.go:48 ResourceSpans #0
--
Resource labels:
-> telemetry.sdk.language: STRING(python)
-> telemetry.sdk.name: STRING(opentelemetry)
-> telemetry.sdk.version: STRING(1.2.0)
-> cloud.region: STRING(ap-southeast-2)
-> cloud.provider: STRING(aws)
-> faas.name: STRING(aws-python-api-worker-project-dev-consumer)
-> faas.version: STRING($LATEST)
-> service.name: STRING(aws-python-api-worker-project-dev-consumer)
InstrumentationLibrarySpans #0
InstrumentationLibrary opentelemetry.instrumentation.aws_lambda 0.16.dev0
Span #0
Trace ID : 60d3bd0d1c71895e0786483748de7acf
Parent ID : 84ecb3d0f1da98b3
ID : ebd0715a456a282b
Name : handler.consumer
Kind : SPAN_KIND_SERVER
Start time : 2021-06-23 23:00:34.733421785 +0000 UTC
End time : 2021-06-23 23:00:34.734093944 +0000 UTC
Status code : STATUS_CODE_UNSET
Status message :
Attributes:
-> faas.execution: STRING(6c00c38a-b1fe-4aac-955e-345a91831150)
-> faas.id: STRING(arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-consumer)
-> faas.name: STRING(aws-python-api-worker-project-dev-consumer)
-> faas.version: STRING($LATEST)
END RequestId: 6c00c38a-b1fe-4aac-955e-345a91831150
@michaelhyatt Can you try to disable X-Ray tracing in the API Gateway (https://docs.aws.amazon.com/xray/latest/devguide/xray-services-apigateway.html) and see whether it solves the observed problem?
@AlexanderWert I disabled the API gateway tracing and now it doesn't appear on X-Ray service map or waterfall diagram:
Kibana APM UI still doesn't show the trace.
These are the documents recorded in Elasticsearch.
Consumer transaction
{
"_index": "apm-7.13.2-transaction-000001",
"_type": "_doc",
"_id": "D-ToRXoB6AbCXZTTT6Dt",
"_version": 1,
"_score": null,
"fields": {
"transaction.name.text": ["handler.consumer"],
"service.framework.version": ["0.16.dev0"],
"observer.name": ["instance-0000000009"],
"labels.faas_name": ["aws-python-api-worker-project-dev-consumer"],
"service.language.name": ["python"],
"labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-consumer"],
"transaction.sampled": [true],
"transaction.id": ["8d38b2d9bee1c20f"],
"trace.id": ["60d681143464a3ff1c61880768907b0a"],
"processor.event": ["transaction"],
"agent.name": ["opentelemetry/python"],
"labels.faas_execution": ["4bc4f6f3-71e3-4380-91d7-5277a79be05e"],
"event.outcome": ["unknown"],
"cloud.region": ["ap-southeast-2"],
"service.name": ["aws-python-api-worker-project-dev-consumer"],
"service.framework.name": ["opentelemetry.instrumentation.aws_lambda"],
"processor.name": ["transaction"],
"transaction.duration.us": [605],
"labels.faas_version": ["$LATEST"],
"observer.version_major": [7],
"observer.hostname": ["1e0706f92f11"],
"transaction.type": ["custom"],
"cloud.provider": ["aws"],
"event.ingested": ["2021-06-26T01:21:30.604Z"],
"timestamp.us": [1624670489577889],
"@timestamp": ["2021-06-26T01:21:29.577Z"],
"ecs.version": ["1.8.0"],
"observer.type": ["apm-server"],
"observer.version": ["7.13.2"],
"agent.version": ["1.2.0"],
"parent.id": ["e53f6b191e1d0281"],
"transaction.name": ["handler.consumer"]
},
"highlight": {
"trace.id": ["@kibana-highlighted-field@60d681143464a3ff1c61880768907b0a@/kibana-highlighted-field@"]
},
"sort": [1624670489577]
}
Producer span
{
"_index": "apm-7.13.2-span-000001",
"_type": "_doc",
"_id": "EeToRXoB6AbCXZTTT6Dt",
"_version": 1,
"_score": null,
"fields": {
"span.destination.service.type": ["external"],
"span.name": ["HTTP GET"],
"destination.port": [443],
"observer.name": ["instance-0000000009"],
"labels.faas_name": ["aws-python-api-worker-project-dev-producer"],
"trace.id": ["60d681143464a3ff1c61880768907b0a"],
"span.duration.us": [2386427],
"processor.event": ["span"],
"agent.name": ["opentelemetry/python"],
"destination.address": ["bydwx5ss24.execute-api.ap-southeast-2.amazonaws.com"],
"event.outcome": ["success"],
"cloud.region": ["ap-southeast-2"],
"service.name": ["aws-python-api-worker-project-dev-producer"],
"processor.name": ["transaction"],
"labels.faas_version": ["$LATEST"],
"span.id": ["e5edc4fb3045615a"],
"span.subtype": ["http"],
"observer.version_major": [7],
"observer.hostname": ["1e0706f92f11"],
"span.type": ["external"],
"cloud.provider": ["aws"],
"timestamp.us": [1624670487255663],
"@timestamp": ["2021-06-26T01:21:27.255Z"],
"ecs.version": ["1.8.0"],
"observer.type": ["apm-server"],
"observer.version": ["7.13.2"],
"agent.version": ["1.2.0"],
"parent.id": ["a64ec14bc1536181"],
"span.destination.service.name": ["https://bydwx5ss24.execute-api.ap-southeast-2.amazonaws.com"],
"span.destination.service.resource": ["bydwx5ss24.execute-api.ap-southeast-2.amazonaws.com:443"]
},
"highlight": {
"trace.id": ["@kibana-highlighted-field@60d681143464a3ff1c61880768907b0a@/kibana-highlighted-field@"]
},
"sort": [1624670487255]
}
Producer transaction
{
"_index": "apm-7.13.2-transaction-000001",
"_type": "_doc",
"_id": "EOToRXoB6AbCXZTTT6Dt",
"_version": 1,
"_score": null,
"fields": {
"transaction.name.text": ["handler.producer"],
"service.framework.version": ["0.21b0"],
"observer.name": ["instance-0000000009"],
"labels.faas_name": ["aws-python-api-worker-project-dev-producer"],
"service.language.name": ["python"],
"labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-producer"],
"transaction.sampled": [true],
"transaction.id": ["a64ec14bc1536181"],
"trace.id": ["60d681143464a3ff1c61880768907b0a"],
"processor.event": ["transaction"],
"agent.name": ["opentelemetry/python"],
"labels.faas_execution": ["4580f63e-2761-408e-99f3-046cb774d7e3"],
"event.outcome": ["unknown"],
"cloud.region": ["ap-southeast-2"],
"service.name": ["aws-python-api-worker-project-dev-producer"],
"service.framework.name": ["opentelemetry.instrumentation.requests"],
"processor.name": ["transaction"],
"transaction.duration.us": [2389360],
"labels.faas_version": ["$LATEST"],
"observer.version_major": [7],
"observer.hostname": ["1e0706f92f11"],
"transaction.type": ["custom"],
"cloud.provider": ["aws"],
"event.ingested": ["2021-06-26T01:21:30.605Z"],
"timestamp.us": [1624670487254602],
"@timestamp": ["2021-06-26T01:21:27.254Z"],
"ecs.version": ["1.8.0"],
"observer.type": ["apm-server"],
"observer.version": ["7.13.2"],
"agent.version": ["1.2.0"],
"parent.id": ["7ade6ef5e6220237"],
"transaction.name": ["handler.producer"]
},
"highlight": {
"trace.id": ["@kibana-highlighted-field@60d681143464a3ff1c61880768907b0a@/kibana-highlighted-field@"]
},
"sort": [1624670487254]
}
@michaelhyatt I've transferred this to elastic/apm as (AFAICS) there's no bug with the translation of OpenTelemetry spans -- the server is translating them faithfully. It appears that API GW is injecting/mutating traceparent, but is not sending any OTel spans. When we receive an OTel span for Lambda, it has a parent span ID which points to nowhere.
That missing API GW span would theoretically be the trace root, and we base certain features on the trace roots, such as the list of traces. If API GW spans were sent to APM Server, then I expect it would all just work. Seems like missing/broken functionality in API GW to me.
@axw The latest test was done without API-GW tracing turned on. Are you saying the API-GW is still changing the trace header even though it is not part of X-Ray traces anymore?
Yes, it looks like API GW is still adding a traceparent header before both Lambdas. The producer transaction has a field parent.id:7ade6ef5e6220237
, and the consumer transaction has parent.id:e53f6b191e1d0281
. There are no transactions or spans with these IDs.
Yes, it looks like API GW is still adding a traceparent header before both Lambdas.
Could it be the Lambda extension that is adding the traceparent
(but not sending OTel spans but rather collects X-Ray-specific spans)? For instance for the Lambda initialization phase as shown in the screenshot above.
Possibly, I don't know. I should have said X-Ray header anyway, since there's no W3C traceparent headers involved here. The bottom line is that we're being told the producer span has a parent which does not exist, and so none of the spans is considered to be the trace root.
I come to the same conclusion
Type (Transaction / Span) | id | parent.id | timestamp.us | name | service.name |
---|---|---|---|---|---|
T | a64ec14bc1536181 | 7ade6ef5e6220237 | 1624670487254602 | handler.producer | aws-python-api-worker-project-dev-producer |
S | e5edc4fb3045615a | a64ec14bc1536181 | 1624670487255663 | HTTP GET | aws-python-api-worker-project-dev-producer |
T | 8d38b2d9bee1c20f | e53f6b191e1d0281 | 1624670489577889 | handler.consumer | aws-python-api-worker-project-dev-consumer |
The parent.id
of the consumer/downstream transaction is not the trace.id
of the parent HTTP GET
span as expected. It's an e53f6b191e1d0281
instead of e5edc4fb3045615a
I tested this setup with Jaeger and it is able to handle the case quite well, although it displayed warnings about the missing parents.