apm icon indicating copy to clipboard operation
apm copied to clipboard

APM UI doesn't show connected traces sent from AWS Distro for OTel (Lambda)

Open michaelhyatt opened this issue 3 years ago • 11 comments

APM Server version (apm-server version): 7.13.1

Description of the problem including expected versus actual behavior: I managed to get the distributed tracing going with AWS X-Ray and the OTel collector to pick it up. The flow is curl => API-GW => producer-lambda => API-GW => consumer-lambda. The code (serverless): https://github.com/michaelhyatt/serverless-aws-python-api-worker-otel

The interesting thing is that we are recording the 2 transactions with 1 span correctly as having the same trace.id, however, the APM UI doesn’t really show a proper trace.

AWS X-Ray showing the trace with 2 components: image

Discover shows 3 records, 2 transactions and 1 span with the same trace.id: image

The APM UI is showing 2 services, but no traces: image

Steps to reproduce: As per README, requires the serverless framework.

JSON documents

Consumer transaction

{
	"_index": "apm-7.13.1-transaction-000001",
	"_type": "_doc",
	"_id": "TdX7OHoBmWidblu2OYlN",
	"_version": 1,
	"_score": null,
	"fields": {
		"transaction.name.text": ["handler.consumer"],
		"service.framework.version": ["0.16.dev0"],
		"observer.name": ["instance-0000000008"],
		"labels.faas_name": ["aws-python-api-worker-project-dev-consumer"],
		"service.language.name": ["python"],
		"labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-consumer"],
		"transaction.sampled": [true],
		"transaction.id": ["1b5693f6e07de9c7"],
		"trace.id": ["60d331f41b8d383766d158e55359aaca"],
		"processor.event": ["transaction"],
		"agent.name": ["opentelemetry/python"],
		"labels.faas_execution": ["c7ab9723-9e43-4488-89d0-e1759190b984"],
		"event.outcome": ["unknown"],
		"cloud.region": ["ap-southeast-2"],
		"service.name": ["aws-python-api-worker-project-dev-consumer"],
		"service.framework.name": ["opentelemetry.instrumentation.aws_lambda"],
		"processor.name": ["transaction"],
		"transaction.duration.us": [698],
		"labels.faas_version": ["$LATEST"],
		"observer.version_major": [7],
		"observer.hostname": ["60e5fd4e9651"],
		"transaction.type": ["custom"],
		"cloud.provider": ["aws"],
		"event.ingested": ["2021-06-23T13:07:06.188Z"],
		"timestamp.us": [1624453624947661],
		"@timestamp": ["2021-06-23T13:07:04.947Z"],
		"ecs.version": ["1.8.0"],
		"observer.type": ["apm-server"],
		"observer.version": ["7.13.1"],
		"agent.version": ["1.2.0"],
		"parent.id": ["8ae9f83b97976607"],
		"transaction.name": ["handler.consumer"]
	},
	"highlight": {
		"trace.id": ["@kibana-highlighted-field@60d331f41b8d383766d158e55359aaca@/kibana-highlighted-field@"]
	},
	"sort": [1624453624947]
}

Producer HTTP span

{
	"_index": "apm-7.13.1-span-000001",
	"_type": "_doc",
	"_id": "T9X7OHoBmWidblu2OYlN",
	"_version": 1,
	"_score": null,
	"fields": {
		"span.destination.service.type": ["external"],
		"span.name": ["HTTP GET"],
		"destination.port": [443],
		"observer.name": ["instance-0000000008"],
		"labels.faas_name": ["aws-python-api-worker-project-dev-producer"],
		"trace.id": ["60d331f41b8d383766d158e55359aaca"],
		"span.duration.us": [2256175],
		"processor.event": ["span"],
		"agent.name": ["opentelemetry/python"],
		"destination.address": ["3y00jobl0c.execute-api.ap-southeast-2.amazonaws.com"],
		"event.outcome": ["success"],
		"cloud.region": ["ap-southeast-2"],
		"service.name": ["aws-python-api-worker-project-dev-producer"],
		"processor.name": ["transaction"],
		"labels.faas_version": ["$LATEST"],
		"span.id": ["898690921573f428"],
		"span.subtype": ["http"],
		"observer.version_major": [7],
		"observer.hostname": ["60e5fd4e9651"],
		"span.type": ["external"],
		"cloud.provider": ["aws"],
		"timestamp.us": [1624453622779541],
		"@timestamp": ["2021-06-23T13:07:02.779Z"],
		"ecs.version": ["1.8.0"],
		"observer.type": ["apm-server"],
		"observer.version": ["7.13.1"],
		"agent.version": ["1.2.0"],
		"parent.id": ["8b8e0c642d96e5f2"],
		"span.destination.service.name": ["https://3y00jobl0c.execute-api.ap-southeast-2.amazonaws.com"],
		"span.destination.service.resource": ["3y00jobl0c.execute-api.ap-southeast-2.amazonaws.com:443"]
	},
	"highlight": {
		"trace.id": ["@kibana-highlighted-field@60d331f41b8d383766d158e55359aaca@/kibana-highlighted-field@"]
	},
	"sort": [1624453622779]
}

Producer transaction

{
	"_index": "apm-7.13.1-transaction-000001",
	"_type": "_doc",
	"_id": "TtX7OHoBmWidblu2OYlN",
	"_version": 1,
	"_score": null,
	"fields": {
		"transaction.name.text": ["handler.producer"],
		"service.framework.version": ["0.21b0"],
		"observer.name": ["instance-0000000008"],
		"labels.faas_name": ["aws-python-api-worker-project-dev-producer"],
		"service.language.name": ["python"],
		"labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-producer"],
		"transaction.sampled": [true],
		"transaction.id": ["8b8e0c642d96e5f2"],
		"trace.id": ["60d331f41b8d383766d158e55359aaca"],
		"processor.event": ["transaction"],
		"agent.name": ["opentelemetry/python"],
		"labels.faas_execution": ["fdcc0a5f-ce57-4201-a631-6097957f4a78"],
		"event.outcome": ["unknown"],
		"cloud.region": ["ap-southeast-2"],
		"service.name": ["aws-python-api-worker-project-dev-producer"],
		"service.framework.name": ["opentelemetry.instrumentation.requests"],
		"processor.name": ["transaction"],
		"transaction.duration.us": [2259030],
		"labels.faas_version": ["$LATEST"],
		"observer.version_major": [7],
		"observer.hostname": ["60e5fd4e9651"],
		"transaction.type": ["custom"],
		"cloud.provider": ["aws"],
		"event.ingested": ["2021-06-23T13:07:06.189Z"],
		"timestamp.us": [1624453622778742],
		"@timestamp": ["2021-06-23T13:07:02.778Z"],
		"ecs.version": ["1.8.0"],
		"observer.type": ["apm-server"],
		"observer.version": ["7.13.1"],
		"agent.version": ["1.2.0"],
		"parent.id": ["918243f3ba6e3f71"],
		"transaction.name": ["handler.producer"]
	},
	"highlight": {
		"trace.id": ["@kibana-highlighted-field@60d331f41b8d383766d158e55359aaca@/kibana-highlighted-field@"]
	},
	"sort": [1624453622778]
}

michaelhyatt avatar Jun 24 '21 03:06 michaelhyatt

@michaelhyatt would you be able to run the opentelemetry-collector with the logging exporter, and capture the logged spans? Then we can compare to the docs we're recording.

One thing I can see is that the consumer transaction's parent.id does not refer to either of the other events. Also, the producer transaction has a parent.id - so neither of the transactions is considered the root, hence why they don't show up under Traces. I'm guessing that the API Gateway is updating traceparent, but there doesn't appear to be any spans for API Gateway. I'm wildly guessing that API Gateway is sending XRay events but not OTel spans.

axw avatar Jun 24 '21 08:06 axw

@axw I have the debug logs of the embedded OTel collector that recorded the messages it sends. 2 logs are attached. Here is the example of the debug of the span that the consumer reports:



2021-06-23T23:00:34.750Z	DEBUG	loggingexporter/logging_exporter.go:48	ResourceSpans #0
--
Resource labels:
-> telemetry.sdk.language: STRING(python)
-> telemetry.sdk.name: STRING(opentelemetry)
-> telemetry.sdk.version: STRING(1.2.0)
-> cloud.region: STRING(ap-southeast-2)
-> cloud.provider: STRING(aws)
-> faas.name: STRING(aws-python-api-worker-project-dev-consumer)
-> faas.version: STRING($LATEST)
-> service.name: STRING(aws-python-api-worker-project-dev-consumer)
InstrumentationLibrarySpans #0
InstrumentationLibrary opentelemetry.instrumentation.aws_lambda 0.16.dev0
Span #0
Trace ID       : 60d3bd0d1c71895e0786483748de7acf
Parent ID      : 84ecb3d0f1da98b3
ID             : ebd0715a456a282b
Name           : handler.consumer
Kind           : SPAN_KIND_SERVER
Start time     : 2021-06-23 23:00:34.733421785 +0000 UTC
End time       : 2021-06-23 23:00:34.734093944 +0000 UTC
Status code    : STATUS_CODE_UNSET
Status message :
Attributes:
-> faas.execution: STRING(6c00c38a-b1fe-4aac-955e-345a91831150)
-> faas.id: STRING(arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-consumer)
-> faas.name: STRING(aws-python-api-worker-project-dev-consumer)
-> faas.version: STRING($LATEST)
 
END RequestId: 6c00c38a-b1fe-4aac-955e-345a91831150

consumer.txt producer.txt

michaelhyatt avatar Jun 24 '21 11:06 michaelhyatt

@michaelhyatt Can you try to disable X-Ray tracing in the API Gateway (https://docs.aws.amazon.com/xray/latest/devguide/xray-services-apigateway.html) and see whether it solves the observed problem?

AlexanderWert avatar Jun 24 '21 12:06 AlexanderWert

@AlexanderWert I disabled the API gateway tracing and now it doesn't appear on X-Ray service map or waterfall diagram: image image

Kibana APM UI still doesn't show the trace.

These are the documents recorded in Elasticsearch.

Consumer transaction

{
	"_index": "apm-7.13.2-transaction-000001",
	"_type": "_doc",
	"_id": "D-ToRXoB6AbCXZTTT6Dt",
	"_version": 1,
	"_score": null,
	"fields": {
		"transaction.name.text": ["handler.consumer"],
		"service.framework.version": ["0.16.dev0"],
		"observer.name": ["instance-0000000009"],
		"labels.faas_name": ["aws-python-api-worker-project-dev-consumer"],
		"service.language.name": ["python"],
		"labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-consumer"],
		"transaction.sampled": [true],
		"transaction.id": ["8d38b2d9bee1c20f"],
		"trace.id": ["60d681143464a3ff1c61880768907b0a"],
		"processor.event": ["transaction"],
		"agent.name": ["opentelemetry/python"],
		"labels.faas_execution": ["4bc4f6f3-71e3-4380-91d7-5277a79be05e"],
		"event.outcome": ["unknown"],
		"cloud.region": ["ap-southeast-2"],
		"service.name": ["aws-python-api-worker-project-dev-consumer"],
		"service.framework.name": ["opentelemetry.instrumentation.aws_lambda"],
		"processor.name": ["transaction"],
		"transaction.duration.us": [605],
		"labels.faas_version": ["$LATEST"],
		"observer.version_major": [7],
		"observer.hostname": ["1e0706f92f11"],
		"transaction.type": ["custom"],
		"cloud.provider": ["aws"],
		"event.ingested": ["2021-06-26T01:21:30.604Z"],
		"timestamp.us": [1624670489577889],
		"@timestamp": ["2021-06-26T01:21:29.577Z"],
		"ecs.version": ["1.8.0"],
		"observer.type": ["apm-server"],
		"observer.version": ["7.13.2"],
		"agent.version": ["1.2.0"],
		"parent.id": ["e53f6b191e1d0281"],
		"transaction.name": ["handler.consumer"]
	},
	"highlight": {
		"trace.id": ["@kibana-highlighted-field@60d681143464a3ff1c61880768907b0a@/kibana-highlighted-field@"]
	},
	"sort": [1624670489577]
}

Producer span

{
	"_index": "apm-7.13.2-span-000001",
	"_type": "_doc",
	"_id": "EeToRXoB6AbCXZTTT6Dt",
	"_version": 1,
	"_score": null,
	"fields": {
		"span.destination.service.type": ["external"],
		"span.name": ["HTTP GET"],
		"destination.port": [443],
		"observer.name": ["instance-0000000009"],
		"labels.faas_name": ["aws-python-api-worker-project-dev-producer"],
		"trace.id": ["60d681143464a3ff1c61880768907b0a"],
		"span.duration.us": [2386427],
		"processor.event": ["span"],
		"agent.name": ["opentelemetry/python"],
		"destination.address": ["bydwx5ss24.execute-api.ap-southeast-2.amazonaws.com"],
		"event.outcome": ["success"],
		"cloud.region": ["ap-southeast-2"],
		"service.name": ["aws-python-api-worker-project-dev-producer"],
		"processor.name": ["transaction"],
		"labels.faas_version": ["$LATEST"],
		"span.id": ["e5edc4fb3045615a"],
		"span.subtype": ["http"],
		"observer.version_major": [7],
		"observer.hostname": ["1e0706f92f11"],
		"span.type": ["external"],
		"cloud.provider": ["aws"],
		"timestamp.us": [1624670487255663],
		"@timestamp": ["2021-06-26T01:21:27.255Z"],
		"ecs.version": ["1.8.0"],
		"observer.type": ["apm-server"],
		"observer.version": ["7.13.2"],
		"agent.version": ["1.2.0"],
		"parent.id": ["a64ec14bc1536181"],
		"span.destination.service.name": ["https://bydwx5ss24.execute-api.ap-southeast-2.amazonaws.com"],
		"span.destination.service.resource": ["bydwx5ss24.execute-api.ap-southeast-2.amazonaws.com:443"]
	},
	"highlight": {
		"trace.id": ["@kibana-highlighted-field@60d681143464a3ff1c61880768907b0a@/kibana-highlighted-field@"]
	},
	"sort": [1624670487255]
}

Producer transaction

{
	"_index": "apm-7.13.2-transaction-000001",
	"_type": "_doc",
	"_id": "EOToRXoB6AbCXZTTT6Dt",
	"_version": 1,
	"_score": null,
	"fields": {
		"transaction.name.text": ["handler.producer"],
		"service.framework.version": ["0.21b0"],
		"observer.name": ["instance-0000000009"],
		"labels.faas_name": ["aws-python-api-worker-project-dev-producer"],
		"service.language.name": ["python"],
		"labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-producer"],
		"transaction.sampled": [true],
		"transaction.id": ["a64ec14bc1536181"],
		"trace.id": ["60d681143464a3ff1c61880768907b0a"],
		"processor.event": ["transaction"],
		"agent.name": ["opentelemetry/python"],
		"labels.faas_execution": ["4580f63e-2761-408e-99f3-046cb774d7e3"],
		"event.outcome": ["unknown"],
		"cloud.region": ["ap-southeast-2"],
		"service.name": ["aws-python-api-worker-project-dev-producer"],
		"service.framework.name": ["opentelemetry.instrumentation.requests"],
		"processor.name": ["transaction"],
		"transaction.duration.us": [2389360],
		"labels.faas_version": ["$LATEST"],
		"observer.version_major": [7],
		"observer.hostname": ["1e0706f92f11"],
		"transaction.type": ["custom"],
		"cloud.provider": ["aws"],
		"event.ingested": ["2021-06-26T01:21:30.605Z"],
		"timestamp.us": [1624670487254602],
		"@timestamp": ["2021-06-26T01:21:27.254Z"],
		"ecs.version": ["1.8.0"],
		"observer.type": ["apm-server"],
		"observer.version": ["7.13.2"],
		"agent.version": ["1.2.0"],
		"parent.id": ["7ade6ef5e6220237"],
		"transaction.name": ["handler.producer"]
	},
	"highlight": {
		"trace.id": ["@kibana-highlighted-field@60d681143464a3ff1c61880768907b0a@/kibana-highlighted-field@"]
	},
	"sort": [1624670487254]
}

michaelhyatt avatar Jun 26 '21 01:06 michaelhyatt

@michaelhyatt I've transferred this to elastic/apm as (AFAICS) there's no bug with the translation of OpenTelemetry spans -- the server is translating them faithfully. It appears that API GW is injecting/mutating traceparent, but is not sending any OTel spans. When we receive an OTel span for Lambda, it has a parent span ID which points to nowhere.

That missing API GW span would theoretically be the trace root, and we base certain features on the trace roots, such as the list of traces. If API GW spans were sent to APM Server, then I expect it would all just work. Seems like missing/broken functionality in API GW to me.

axw avatar Jun 28 '21 01:06 axw

@axw The latest test was done without API-GW tracing turned on. Are you saying the API-GW is still changing the trace header even though it is not part of X-Ray traces anymore?

michaelhyatt avatar Jun 28 '21 03:06 michaelhyatt

Yes, it looks like API GW is still adding a traceparent header before both Lambdas. The producer transaction has a field parent.id:7ade6ef5e6220237, and the consumer transaction has parent.id:e53f6b191e1d0281. There are no transactions or spans with these IDs.

axw avatar Jun 28 '21 03:06 axw

Yes, it looks like API GW is still adding a traceparent header before both Lambdas.

Could it be the Lambda extension that is adding the traceparent (but not sending OTel spans but rather collects X-Ray-specific spans)? For instance for the Lambda initialization phase as shown in the screenshot above.

AlexanderWert avatar Jun 28 '21 05:06 AlexanderWert

Possibly, I don't know. I should have said X-Ray header anyway, since there's no W3C traceparent headers involved here. The bottom line is that we're being told the producer span has a parent which does not exist, and so none of the spans is considered to be the trace root.

axw avatar Jun 28 '21 06:06 axw

I come to the same conclusion

Type (Transaction / Span) id parent.id timestamp.us name service.name
T a64ec14bc1536181 7ade6ef5e6220237 1624670487254602 handler.producer aws-python-api-worker-project-dev-producer
S e5edc4fb3045615a a64ec14bc1536181 1624670487255663 HTTP GET aws-python-api-worker-project-dev-producer
T 8d38b2d9bee1c20f e53f6b191e1d0281 1624670489577889 handler.consumer aws-python-api-worker-project-dev-consumer

The parent.id of the consumer/downstream transaction is not the trace.id of the parent HTTP GETspan as expected. It's an e53f6b191e1d0281 instead of e5edc4fb3045615a

cyrille-leclerc avatar Jun 28 '21 07:06 cyrille-leclerc

I tested this setup with Jaeger and it is able to handle the case quite well, although it displayed warnings about the missing parents.

image

image

michaelhyatt avatar Jun 30 '21 23:06 michaelhyatt