tempo
tempo copied to clipboard
[Metrics generator] Prioritise OTel DB semantic attributes over `db.name` and `db.system`
Is your feature request related to a problem? Please describe.
If a database node does not produce spans of its own, Tempo looks for db.name and db.system in the calling span's attributes to create an appropriate downstream edge and node to complete a service graph.
However, these two attributes actually pertain to the database name inside the downstream service and the database type, rather than the hostname of the database itself.
For example, this means that should there be several upstream services that call the same downstream database service on a specific host, but actually make requests to different database names within that service, you'll see several different service nodes generated, instead of just one that is called by multiple upstream services.
For db.system, this would actually have the reverse effect, where there may be multiple hosts serving databases of the same type, but which only produce a single service graph node that will appear to be called by all upstream services.
Describe the solution you'd like The OpenTelemetry semantic conventions for connection level attributes is arguably what we probably want to prioritise instead, if they are present. I'd suggest maybe:
server.addressserver.socket.addressdb.connection_string(bit handwavey about this, especially if it includes authentication details, although it's no worse than already seeing them in a span)
Should none of those connection attributes be present, then falling back to db.name and then db.system should occur, on the pretext that generating a node is better than not doing so.
Describe alternatives you've considered
You can, of course, use the attribute processor in a relevant collector to copy any relevant attribute that's already present to peer.service, db.name or db.system instead, but if these are already present then the potential for overwriting data that is actually of use is high.
Additional context N/A
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.
For db.system, this would actually have the reverse effect, where there may be multiple hosts serving databases of the same type, but which only produce a single service graph node that will appear to be called by all upstream services.
Is this essentially the problem that the user is describing in this issue: https://github.com/grafana/tempo/issues/3082?
I am awful, never saw the update to this issue. :(
Yes, @09jvilla that's exactly what the issue was in this case, the db.system name is the same so it's considered the same DB.
I've actually come back to this as another issue as user's seen is similar (but not the same) and I was curious if we'd done anything about this yet.
Unfortunately no, and we haven't actively prioritized any work on this for the next 3 months. We can bring it up at an upcoming Tempo call for discussion to get a sense of how much effort it is. If its on the larger side, it'll probably have to wait given the focus on TraceQL metrics queries (both adding new functions and improving performance).
Do you have any screenshots that you could drop on here of what the service graph looks like today? Just helps illustrate the problem a bit better (there was a screenshot in the issue I linked that was very helpful).
Here's a couple of screenshots that exhibit the problem in a nutshell.
The following shows two separate traces, each using different instances of services, including two separate Postgres DB instances, one available at mythical-database and the other at mythical-database2 (look at the db.connection_string attribute). Both DBs have an internal instance of a DB called postgres that have nothing to do with each other, as they're obviously in separate instances.
Here's the service graph view for all of the traces being ingested by Tempo.
Because the db.name is being used to determine a remote instance of a DB instance, instead of the connection string, it appears that both the mythical-server and mythical-server2 instances are using the same, single instance of a Postgres DB service. But that's not the case, there are individual Postgres instances for each of these separate services. This should show two, separate Postgres instances, each being exclusively used by each instance of the server service.
Makes perfect sense - thanks for providing the example @hedss . I see that you would be able to disambiguate between the two instances if we instead used the connection string attribute to build the service graph.
Couldn't tell from the screenshots if your example spans had the service address and socket address attributes that you suggested using. (So in this example, it seems like we'd want to use connection string.)
https://raintank-corp.slack.com/archives/C02C74N5C05/p1716404702262149
Summary idea for the lookup order would be the following. If present, the attribute would be used to identify the node on the graph.
peer.serviceserver.addressnetwork.peer.address:network.peer.port- Fall back to existing behavior
https://opentelemetry.io/docs/specs/semconv/general/attributes/ https://opentelemetry.io/docs/specs/semconv/database/database-spans/
@09jvilla Sorry, delayed response again! The attribute being used here is db.name, as the server address isn't actually available. Zach's posited a good way forward I think, which is use the newly updated attributes and then fallback to our current defaults if none of those exist.