aspire icon indicating copy to clipboard operation
aspire copied to clipboard

Oracle OpenTelemetry support

Open cmdkeen opened this issue 1 year ago • 10 comments
trafficstars

Closes #4072 by updating to the new ODP.Net version and adding OTel support via Oracle.ManagedDataAccess.OpenTelemetry. As mentioned in the issue this will need a new package and bumped version adding to the NuGet feeds.

Used the tracing settings taken from the Oracle sample to ensure that EF Core SQL text shows is provided to the traces.

Added an EndToEnd test using the Npgsql one as a base.

Microsoft Reviewers: Open in CodeFlow

cmdkeen avatar May 03 '24 20:05 cmdkeen

@mitchdenny or @danmoseley are you able to take a look at this?

cmdkeen avatar May 07 '24 08:05 cmdkeen

Adding @eerhardt for the Otel side of things.

mitchdenny avatar May 07 '24 08:05 mitchdenny

Importing the Otel package to internal feed so we can build this.

mitchdenny avatar May 07 '24 09:05 mitchdenny

Importing the Otel package to internal feed so we can build this.

@mitchdenny or @eerhardt are you able to add Testcontainers.Oracle to the internal feed as well to get this compiling again.

cmdkeen avatar May 08 '24 15:05 cmdkeen

@mitchdenny or @eerharor @eerhardt are you able to add Testcontainers.Oracle to the internal feed as well to get this compiling again.

Importing that package now. Should be there in 30 mins.

mitchdenny avatar May 09 '24 02:05 mitchdenny

@eerhardt @sebastienros would you be able to take a look with the failing tests? I've got them working on my machine, and made some progress on working out what's needed on the them running on the build agent (adding the Helix reference helped) but it looks like the ones that use Docker are still all failing. I did adjust the TriggerActivity to try and be obvious about connection issues to the database, but it feels like this is something else.

cmdkeen avatar May 10 '24 16:05 cmdkeen

@radical The Helix tests are failing on things related to this PR, can you sport something wrong with how the new tests are written?

sebastienros avatar May 10 '24 17:05 sebastienros

Looking at the test results the first test fails with: Screenshot 2024-05-10 at 14 39 40

Oracle might be taking a longer time to start up, or failing at some point? I would try to add some waits to confirm that the db is up, and accessible. This might be helpful too.

Also, there is https://github.com/dotnet/aspire/issues/3161 but it doesn't use testcontainers, and instead depends on the AppHost starting the container.

radical avatar May 10 '24 18:05 radical

/azp run

eerhardt avatar May 22 '24 18:05 eerhardt

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar May 22 '24 18:05 azure-pipelines[bot]

It looks like we aren't getting the database icon to show up in the dashboard:

image

compare that to SqlServerEndToEnd: image

The reason appears to be because there is no peer.service attribute/tag on the span:

image

vs.

image

@JamesNK - is there a reason why the span has to have peer.service for the DB icon to show up? Can't it having db.system be enough?

eerhardt avatar Jun 03 '24 18:06 eerhardt

peer.service indicates the span goes to an external app. The icon only appears in that situation.

The value of peer.service is used to map to a resource with the same address and the name is then included in the UI, e.g. sql1.

JamesNK avatar Jun 04 '24 03:06 JamesNK

server.address and server.port also cause the icon to be displayed: https://github.com/dotnet/aspire/blob/09d5efb49e258fe54e15578ff3b37cbbabe48e79/src/Aspire.Dashboard/Otlp/Model/OtlpHelpers.cs#L195-L224

They're both recommend attributes in database trace semantic conventions - https://opentelemetry.io/docs/specs/semconv/database/database-spans/

JamesNK avatar Jun 04 '24 03:06 JamesNK

I went through the different non-default settings of the OracleDataProviderInstrumentationOptions, and setting EnableConnectionLevelAttributes = true does provide:

image

That results in the addition of db.name, db.user, server.address and server.port being set on the spans. Goes back to the earlier discussion about whether we should be setting non-default values or not. If they're recommended values then that is perhaps a case for enabling that setting. Alternatively the approach I've got at the moment is allowing the user to inject settings and we mention this in the documentation, happy to take a steer either way.

Those conventions are useful in clarifying the issue around the SQL statement (SetDbStatementForText) and that it probably should be opt in if users are confident their SQL isn't exposing sensitive values.

Meanwhile back to trying to work out why TestContainers isn't working...

cmdkeen avatar Jun 04 '24 08:06 cmdkeen

Looking at the test results the first test fails with: Screenshot 2024-05-10 at 14 39 40

Oracle might be taking a longer time to start up, or failing at some point? I would try to add some waits to confirm that the db is up, and accessible. This might be helpful too.

Also, there is #3161 but it doesn't use testcontainers, and instead depends on the AppHost starting the container.

Hi @radical after much bashing away at this I managed to unearth that the tests are failing with:

Oracle.ManagedDataAccess.Client.OracleException (0x80004005): ORA-50201: Oracle Communication: Failed to connect to server or failed to parse connect string
 ---> OracleInternal.Network.NetworkException (0x80004005): ORA-50201: Oracle Communication: Failed to connect to server or failed to parse connect string
 ---> OracleInternal.Network.NetworkException (0x80004005): ORA-12514: Cannot connect to database. Service XEPDB1 is not registered with the listener at host 127.0.0.1/127.0.0.1 port 5432. (CONNECTION_ID=c4G0SvajqUmwqX6okJZmQA==)

That ORA-12514 code led me to https://github.com/gvenzl/oci-oracle-xe/issues/225 where it seems there might be issues with different networking setups. I don't know anything about Helix - is it potentially doing something along the lines of network=host?

The wait strategy in place on the container and the output logs at the point of the exception show that the container db has reached a point where it should work. In previous runs I've tried adding infinite waits on fail that went all the way up to the test timeout with no luck either, so it shouldn't be a case of the database still coming up.

cmdkeen avatar Jun 04 '24 08:06 cmdkeen

That results in the addition of db.name, db.user, server.address and server.port being set on the spans.

db.name and db.user aren't standard parameters. Making them opt-in makes sense.

server.address and server.port are part of the standard and is important information. I think the underlying Oracle library should include them by default. Until that happens, updating Aspire component to set configuration to include them is the next best option.

JamesNK avatar Jun 04 '24 08:06 JamesNK

That results in the addition of db.name, db.user, server.address and server.port being set on the spans.

db.name and db.user aren't standard parameters. Making them opt-in makes sense.

server.address and server.port are part of the standard and is important information. I think the underlying Oracle library should include them by default. Until that happens, updating Aspire component to set configuration to include them is the next best option.

@alexkeh is adding the two server properties something you're able to raise?

cmdkeen avatar Jun 04 '24 09:06 cmdkeen

@cmdkeen - Is setting EnableConnectionLevelAttributes = true enough to get all the OTel semantic convention attributes added to the trace? If yes, and if there are no security concerns with what information this setting enables, then setting that by default for now makes sense to me. We can remove it when the Oracle provider adds the OTel semantic convention attributes by default.

eerhardt avatar Jun 04 '24 13:06 eerhardt

@eerhardt looks to be, from the playground with that property turned on results in this for the sendExecuteRequest span: image

image

I'll make that the default for now, with a comment that it can be removed if/when the Oracle library supports.

In other happy news it looks like the Helix Docker tests are finally passing!

cmdkeen avatar Jun 04 '24 14:06 cmdkeen

That results in the addition of db.name, db.user, server.address and server.port being set on the spans.

db.name and db.user aren't standard parameters. Making them opt-in makes sense. server.address and server.port are part of the standard and is important information. I think the underlying Oracle library should include them by default. Until that happens, updating Aspire component to set configuration to include them is the next best option.

@alexkeh is adding the two server properties something you're able to raise?

@cmdkeen Let me check with my dev team.

alexkeh avatar Jun 04 '24 23:06 alexkeh

@mitchdenny and @eerhardt anything more you want doing on this?

cmdkeen avatar Jun 05 '24 12:06 cmdkeen

That results in the addition of db.name, db.user, server.address and server.port being set on the spans.

db.name and db.user aren't standard parameters. Making them opt-in makes sense. server.address and server.port are part of the standard and is important information. I think the underlying Oracle library should include them by default. Until that happens, updating Aspire component to set configuration to include them is the next best option.

@alexkeh is adding the two server properties something you're able to raise?

@cmdkeen Let me check with my dev team.

@cmdkeen I got feedback from my dev team. There are a couple reasons ODP.NET OTel does not expose server.address and server.port by default.

  1. Security - Organizations can be sensitive to have these two property values written to traces inadvertently and unknowingly. Having the user set EnableConnectionLevelAttributes makes it unambiguous.

  2. Other OTel DB provider behavior - We observed that Microsoft SqlClient provider doesn’t trace the equivalent server.address and server.port tags by default either. It requires EnableConnectionLevelAttributes set to true as well. Its default value is false. That applies to the MySQL provider as well.

alexkeh avatar Jun 05 '24 14:06 alexkeh

Other OTel DB provider behavior - We observed that Microsoft SqlClient provider doesn’t trace the equivalent server.address and server.port tags by default either. It requires EnableConnectionLevelAttributes set to true as well. Its default value is false.

When EnableConnectionLevelAttributes is not enabled, Microsoft.SqlClient will emit peer.service though. See this code.

    internal void AddConnectionLevelDetailsToActivity(string dataSource, Activity sqlActivity)
    {
        if (!this.EnableConnectionLevelAttributes)
        {
            sqlActivity.SetTag(SemanticConventions.AttributePeerService, dataSource);
        }
        else
        {
...

eerhardt avatar Jun 05 '24 15:06 eerhardt

That applies to the MySQL provider as well.

For the MySqlConnector library (which we have an Aspire component for), the reason the icon shows up is because MySqlConnector adds the net.peer.name and net.peer.port tags to the Activity.

And then the OTLP exporter appends peer.service to the tag list, using the above 2 tags.

eerhardt avatar Jun 05 '24 17:06 eerhardt

That applies to the MySQL provider as well.

For the MySqlConnector library (which we have an Aspire component for), the reason the icon shows up is because MySqlConnector adds the net.peer.name and net.peer.port tags to the Activity.

And then the OTLP exporter appends peer.service to the tag list, using the above 2 tags.

The Oracle provider doesn't have the peer.service attribute. If we support this attribute, we'll plan to emit it by default.

alexkeh avatar Jun 06 '24 18:06 alexkeh

@eerhardt given the clarification above do you have a steer on what needs doing to get this merged?

  1. Back to the default EnableConnectionLevelAttributes - for now devs will have to manually choose whether to enable. In time the Oracle library might add peer.service support.
  2. Stick with EnableConnectionLevelAttributes=true by default and document that it is configurable noting the security concern

Happy either way - suspect given the security element that 1 is the sensible choice.

cmdkeen avatar Jun 19 '24 14:06 cmdkeen

suspect given the security element that 1 is the sensible choice.

Yeah, that is where I'm leaning. There are reasons this is the default from Oracle, following suite is the better choice.

eerhardt avatar Jun 19 '24 17:06 eerhardt

Oracle might be taking a longer time to start up, or failing at some point? I would try to add some waits to confirm that the db is up, and accessible. This might be helpful too. Also, there is #3161 but it doesn't use testcontainers, and instead depends on the AppHost starting the container.

Hi @radical after much bashing away at this I managed to unearth that the tests are failing with:

Oracle.ManagedDataAccess.Client.OracleException (0x80004005): ORA-50201: Oracle Communication: Failed to connect to server or failed to parse connect string
 ---> OracleInternal.Network.NetworkException (0x80004005): ORA-50201: Oracle Communication: Failed to connect to server or failed to parse connect string
 ---> OracleInternal.Network.NetworkException (0x80004005): ORA-12514: Cannot connect to database. Service XEPDB1 is not registered with the listener at host 127.0.0.1/127.0.0.1 port 5432. (CONNECTION_ID=c4G0SvajqUmwqX6okJZmQA==)

That ORA-12514 code led me to gvenzl/oci-oracle-xe#225 where it seems there might be issues with different networking setups. I don't know anything about Helix - is it potentially doing something along the lines of network=host?

Sorry, I missed your message earlier. Do IIUC that the issue is fixed for the testcontainers case, but you are still seeing the above with EndToEnd tests which depend on the test app starting the containers?

Is it potentially doing something along the lines of network=host?

And is this something that would hinder it? I'm not sure if this is used when starting the container. https://docs.docker.com/network/drivers/host/ says that network=host would mean the container's network is not isolated from the host itself. I would guess that it shouldn't be using this, but I'm not sure. Is there some way to figure that out from within the container?

radical avatar Jun 19 '24 22:06 radical

Hi @radical - thanks for getting back. That networking issue never popped up for the EndToEnd tests in playground as they run locally. I managed to fix it for the Helix tests by using .WithHostname("localhost") so the tests are working. Without a detailed understanding of how Helix is configuring the networking settings I didn't want to dig further into what the problem might be. TBH if it's working now perhaps best to let it lie, at least as long as it seems to be a problem specific to Oracle TestContainer.

cmdkeen avatar Jun 21 '24 15:06 cmdkeen

Hi @sebastienros and @eerhardt - apologies for leaving this, it's holiday season in Europe, though no-one has told the weather here... Have reverted back to the default ODP.Net tracing options, while providing the option to specify if users do want the extra detail and are happy with the security trade-off in their environment. Assuming the builds all go green this should be in a position for a code review.

cmdkeen avatar Jul 12 '24 16:07 cmdkeen