apm-server
apm-server copied to clipboard
APM service map assumes service.environment is null for some services, possibly causing missing links in service map.
Cross-posting from this discussion board thread which I now believe could be a bug.
Elastic version: 8.1.2
Elastic environment: Elastic Cloud on GCP us-east1
APM instrumentation: OpenTelemetry
Client browser: Chrome
Client OS: MacOS 12.1
Describe the bug: The APM Service Map incorrectly sets service.environment
to null
for two of my services, which could might be the reason why they appear orphaned in the service map. I've verified that the spans for those services do have service.environment
set to development
in every span that references those services. This behavior is happening consistently for only those two services, even after starting with a fresh dataset and cluster. Each Python service is instrumented using the same code and the same environment variables (OTEL_RESOURCE_ATTRIBUTES
set to deployment.environment=development
), and therefore they all should behave very similarly for tracing.
Steps to reproduce: This is an instance of microbs ecommerce application. It might be easier to troubleshoot if I provided direct access to the Elastic Cloud deployment where the APM data resides, because the deployment does not have sensitive data.
Response from GET https://ELASTICSEARCH_ENDPOINT/.ds-traces-apm*/_search?q=(service.name:payment+OR+service.name:product)+AND+NOT+service.environment:development
- Observe that there are no spans for the payment or product service in which service.environment
is not development
.
{
"took" : 85,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
Response from GET https://KIBANA_ENDPOINT/internal/apm/service-map
- Observe that service.environment
is null
for the payment and product services, which is inconsistent with the results of the prior query.
{
"elements": [{
"data": {
"id": "web-gateway",
"service.environment": "development",
"service.name": "web-gateway",
"agent.name": "opentelemetry/python"
}
}, {
"data": {
"id": "api-gateway",
"service.name": "api-gateway",
"agent.name": "opentelemetry/cpp"
}
}, {
"data": {
"id": "content",
"service.environment": "development",
"service.name": "content",
"agent.name": "opentelemetry/python"
}
}, {
"data": {
"span.subtype": "http",
"span.destination.service.resource": "storage.googleapis.com:443",
"span.type": "external",
"id": ">storage.googleapis.com:443",
"label": "storage.googleapis.com:443"
}
}, {
"data": {
"id": "checkout",
"service.environment": "development",
"service.name": "checkout",
"agent.name": "opentelemetry/python"
}
}, {
"data": {
"id": "cart",
"service.environment": "development",
"service.name": "cart",
"agent.name": "opentelemetry/python"
}
}, {
"data": {
"span.subtype": "redis",
"span.destination.service.resource": "redis",
"span.type": "db",
"id": ">redis",
"label": "redis"
}
}, {
"data": {
"service.name": "product",
"agent.name": "opentelemetry/python",
"service.environment": null,
"id": "product"
}
}, {
"data": {
"service.name": "payment",
"agent.name": "opentelemetry/python",
"service.environment": null,
"id": "payment"
}
}, {
"data": {
"source": "api-gateway",
"target": "cart",
"id": "api-gateway~cart",
"sourceData": {
"id": "api-gateway",
"service.name": "api-gateway",
"agent.name": "opentelemetry/cpp"
},
"targetData": {
"id": "cart",
"service.environment": "development",
"service.name": "cart",
"agent.name": "opentelemetry/python"
}
}
}, {
"data": {
"source": "api-gateway",
"target": "checkout",
"id": "api-gateway~checkout",
"sourceData": {
"id": "api-gateway",
"service.name": "api-gateway",
"agent.name": "opentelemetry/cpp"
},
"targetData": {
"id": "checkout",
"service.environment": "development",
"service.name": "checkout",
"agent.name": "opentelemetry/python"
},
"bidirectional": true
}
}, {
"data": {
"source": "api-gateway",
"target": "content",
"id": "api-gateway~content",
"sourceData": {
"id": "api-gateway",
"service.name": "api-gateway",
"agent.name": "opentelemetry/cpp"
},
"targetData": {
"id": "content",
"service.environment": "development",
"service.name": "content",
"agent.name": "opentelemetry/python"
}
}
}, {
"data": {
"source": "cart",
"target": ">redis",
"id": "cart~>redis",
"sourceData": {
"id": "cart",
"service.environment": "development",
"service.name": "cart",
"agent.name": "opentelemetry/python"
},
"targetData": {
"span.subtype": "redis",
"span.destination.service.resource": "redis",
"span.type": "db",
"id": ">redis",
"label": "redis"
}
}
}, {
"data": {
"source": "checkout",
"target": "api-gateway",
"id": "checkout~api-gateway",
"sourceData": {
"id": "checkout",
"service.environment": "development",
"service.name": "checkout",
"agent.name": "opentelemetry/python"
},
"targetData": {
"id": "api-gateway",
"service.name": "api-gateway",
"agent.name": "opentelemetry/cpp"
},
"isInverseEdge": true
}
}, {
"data": {
"source": "content",
"target": ">storage.googleapis.com:443",
"id": "content~>storage.googleapis.com:443",
"sourceData": {
"id": "content",
"service.environment": "development",
"service.name": "content",
"agent.name": "opentelemetry/python"
},
"targetData": {
"span.subtype": "http",
"span.destination.service.resource": "storage.googleapis.com:443",
"span.type": "external",
"id": ">storage.googleapis.com:443",
"label": "storage.googleapis.com:443"
}
}
}, {
"data": {
"source": "web-gateway",
"target": "api-gateway",
"id": "web-gateway~api-gateway",
"sourceData": {
"id": "web-gateway",
"service.environment": "development",
"service.name": "web-gateway",
"agent.name": "opentelemetry/python"
},
"targetData": {
"id": "api-gateway",
"service.name": "api-gateway",
"agent.name": "opentelemetry/cpp"
}
}
}]
}
Screenshot 1 of 2 - The service map is missing links among the product and payment services, whose service.environment
is set to null
in the XHR response with the service map data. Note that service.environment
is actually set to development
in all of the spans for those two services, which I confirmed by searching in Discover.
data:image/s3,"s3://crabby-images/9b649/9b6493859c10c4ef487798605eee70932487b178" alt="service-map"
Screenshot 2 of 2 - This trace sample does display links between the services that were unlinked in the service map. This screenshot shows that the payment service is linked to the api-gateway service, but that link doesn't appear in the service map.
data:image/s3,"s3://crabby-images/3d19b/3d19bc082c6c32439fc85671873ef92de3f9c106" alt="trace"
I was able to fix the symptoms by changing api-gateway
from an Nginx service instrumented with opentelemetry/cpp
to a Python service instrumented opentelemetry/python
.
I don't think this is ready to be marked as resolved until we can determine why the opentelemetry/cpp
instrumentation resulted in an incorrect presentation of data in Elastic APM. Plausibly, the opentelemetry/cpp
instrumentation could have omitted data that Elastic APM required, or it could be that Elastic APM is treating the opentelemetry/cpp
data differently. I'm inclined to think it's the former, but I'm not certain. I'll need to look for differences in span data produced by opentelemetry/cpp
and opentelemetry/python
.
Is there guidance on which fields the service map queries to present its graphical view?
Pinging @elastic/apm-ui (Team:apm)
FWICT from the trace waterfall api-gateway
connects to payment
without an exit span, that is, the parent of the transaction on payment
is a transaction on the api-gateway
service. This possibly points to an instrumentation gap. We use exit spans (not transactions) to decide what traces should be sampled for discovering connections, which might be why the connection is not showing up.
As to why service.environment
is missing for the product
and payment
services: we fetch data for all (related) services and show them as orphans in the service map if they don't show up in the traces we've sampled. We don't return anything for service.environment
there, so that's expected.
I think this investigation should indeed focus on opentelemetry/cpp
(and differences with opentelemetry/python
and our own agents). It's likely that opentelemetry/cpp
doesn't create the exit spans we need. I'm not intimately familiar with how OTel spans are translated to exit spans on APM Server to be honest. In any case, I don't think this is a Kibana issue. Should we perhaps move it to the APM Server repo (where exit spans are created for OTel)? @dannycroft
@dgieselaar Yeah, this doesn't sound like a Kibana issue.
@simitt do you want to move this over to the APM Server repo for further investigation?
cc// @felixbarny
I think this investigation should indeed focus on opentelemetry/cpp (and differences with opentelemetry/python and our own agents). It's likely that opentelemetry/cpp doesn't create the exit spans we need.
@davemoore- would you be able to provide two sample events that are sent to the APM Server - one from opentelemetry/cpp
and one from opentelemetry/python
? We can then take a look at the differences and try to identify whether we need to make adoptions in the APM Server code or if there is something indeed missing in opentelemetry/cpp
.