solr icon indicating copy to clipboard operation
solr copied to clipboard

SOLR-17432: Allow OTel Java Agent

Open dsmiley opened this issue 1 year ago • 8 comments

https://issues.apache.org/jira/browse/SOLR-17432

Admittedly there's a lot going on here; the PR is more of a refactoring / minor improvements that also happens to support the Otel agent too, as it's a small aspect. I can separate this PR. A number of things were gnawing at me, which I started to deal with.

API Change: TracerConfigurator subclasses now create an OpenTelemetry not a Tracer. Since we depend on GlobalOpenTelemetry, if a configurator were to return some custom Tracer thing (independent of an OpenTelemetry), it's not clear how we would install it into a GlobalOpenTelemetry. Maybe possible (?) but doesn't seem a good direction. We could stop our use of GlobalOpenTelemetry easily as it's only used in one spot that could actually get the Tracer from elsewhere. Maybe we should do that anyway, but nonetheless installing it is something we should continue to do.

Refactorings / changes:

  • removed all tracing config from MiniSolrCloudCluster
  • SolrTestCase resets tracing now
  • Only TracerConfigurator should call GlobalOpenTelemetry.set (removed from SimplePropagator & OtelTracerConfigurator) and it does so in exactly one spot.
  • Tests examining spans now use OpenTelemetryRule. Removed CustomTestOtelTracerConfigurator.java

todo: test Custom TracerConfigurator ? Do we even have a test actually using OtelTracerConfigurator.java?
Admittedly there's no test here for the Agent approach. Would need a bats test to do right, and it'd need to go download a 20MB jar. I tested it manually when running a test (to debug a test) and I have used it in Solr 9.

dsmiley avatar Sep 02 '24 05:09 dsmiley

When I run the agent, I provide an Otel properties file with: (an excerpt)

otel.metrics.exporter=none
otel.logs.exporter=none
otel.instrumentation.common.default-enabled=false
otel.instrumentation.opentelemetry-api.enabled=true
otel.instrumentation.opentelemetry-instrumentation-annotations.enabled=true
otel.instrumentation.aws-sdk.enabled=true
otel.instrumentation.java-http-client.enabled=true

Notice I disable auto-instrumentation by default and selectively enable the specific instrumentations useful to me. This works with Solr's manual instrumentation of itself as well because of enabling "opentelemetry-api". All this is documented quite well in Otel.

I don't see how this PR would have unintentional changes on existing traces/spans since they are tested and the PR doesn't change them. Any way, I'm happy to share a screenshot later.

This PR misses documentation to let users know they don't have to enable the opentelemetry module just to use opentelemetry :-). (an aside -- nor do we tell users they don't have to have an HDFS cluster to make use of the HDFS Directory). Also I'd like to somewhere document a tip on how to use tracing during development / debugging of a unit test.

dsmiley avatar Sep 15 '24 05:09 dsmiley

I'm a little unhappy with keeping TracerConfigurator named this while having it return an OpenTelemetry. Really; we're embracing OpenTelemetry here. I would love to use OpenTelemetry for metrics eventually, to be honest.

dsmiley avatar Sep 15 '24 05:09 dsmiley

I just saw SOLR-17455 coming in. While mentioning OpenTracing is correct for 9.x, for 10.0 it is gone, so perhaps when you add refGuide docs for this PR, you can also s/OpenTracing/OpenTelemetry/ in the guide?

janhoy avatar Sep 17 '24 09:09 janhoy

This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the [email protected] mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution!

github-actions[bot] avatar Nov 17 '24 00:11 github-actions[bot]

I very much want to get back to this for Solr 10... need to maybe reformulate what's happening into multiple things. CC @mlbiscoc

dsmiley avatar Apr 04 '25 15:04 dsmiley

@dsmiley is this something you are still looking at? I will try to find time to review over the next weeks.

stillalex avatar Apr 29 '25 13:04 stillalex

+1 this looks good to me. when you have some time, could you rebase? curious to see what errors are left. I think the open question is the issue of multiple init calls causing that OTEL exception during tests.

stillalex avatar May 01 '25 12:05 stillalex

I'll come back to this today/tomorrow. I really appreciate the close look @stillalex :-) My biggest question is really scoping. This PR is a bit all over the place; it will take some work to break it into more meaningful chunks. I'll think about this further when I look closer again.

dsmiley avatar May 01 '25 12:05 dsmiley