SOLR-17785: Integrate foundational OTEL meter instruments
Integrate OTEL into SolrMetricManager and SolrMetricContext and update thewt=prometheus endpoint to output Prometheus metrics from OTEL. This still keeps Dropwizard in parallel for now as removing many of its functions breaks many tests and the scope of this PR gets too large in changes.
To summarize all changes below:
SolrMetricManagerandSolrMetricContextcan be used to create OTEL meter instruments such as aLongCounterto record metrics.SolrMetricProducer#initializeMetricstakes a set ofAttributesthat can be used with initializing OTEL metrics eventually replacing scope.- Created
AttributedLongCounter,AttributedDoubleCounter, etc. This binds a set of attributes to a metric working similarly to how Dropwizard initializes. This also avoids having to rebuild Attributes every time metrics were to be captured. - Add OTEL equivalent metric capturing from Dropwizard into
RequestHandlerBaseby updating itsinitializeMetricsand its correspondingRequestHandlerBaseTest PrometheusResponseWriterandPrometheusFormattercode completely removed as OTEL->Prometheus exporter already exists with the OTEL SDK.- A number of
TODOcomments of what and where work is still needed for OTEL migration
Sample Prometheus output from admin/metrics?wt=prometheus OTEL -> Prometheus
solr_metrics_core_requests_total{category="QUERY",collection="foobar",core="foobar_shard1_replica_n1",internal="true",otel_scope_name="solr.core.foobar.shard1.replica_n1",replica="replica_n1",scope="/select",shard="shard1",type="requests"} 0.0
solr_metrics_core_requests_total{category="QUERY",collection="foobar",core="foobar_shard1_replica_n1",internal="true",otel_scope_name="solr.core.foobar.shard1.replica_n1",replica="replica_n1",scope="/select",shard="shard1",type="serverErrors"} 0.0
solr_metrics_core_requests_total{category="QUERY",collection="foobar",core="foobar_shard1_replica_n1",internal="true",otel_scope_name="solr.core.foobar.shard1.replica_n1",replica="replica_n1",scope="/select",shard="shard1",type="timeouts"} 0.0
solr_metrics_core_requests_total{category="QUERY",collection="foobar",core="foobar_shard1_replica_n1",internal="true",otel_scope_name="solr.core.foobar.shard1.replica_n1",replica="replica_n1",scope="/terms",shard="shard1",type="clientErrors"} 0.0
solr_metrics_core_requests_total{category="QUERY",collection="foobar",core="foobar_shard1_replica_n1",internal="true",otel_scope_name="solr.core.foobar.shard1.replica_n1",replica="replica_n1",scope="/terms",shard="shard1",type="errors"} 0.0
The introduction of use of OTEL's "scope" will be very interesting & foundational to Solr's adoption of OTEL metrics. I think that's the next step after this PR.
This PR already has "scope" in the metrics. All metrics have a tag called otel_scope_name. When creating an instrument from SolrMetricsContext, it has to create the instrument from a scope to SolrMetricManager which is the registry name that the SolrMetricsContext holds. Unless we want scope to be something else, I am keeping it the same registries or "scopewhich issolr.node, solr.core.*, solr.jetty`...
Actually the more I work on this, I think I am going to go with the path of renaming SolrMetricsContext to SolrMetricsScope. This is outside the scope of this PR, so will do it in another PR.
For otel_scope_name, we should update scope based on how the OTEL instrumentation scope spec recommends
Developers can decide what denotes a reasonable instrumentation scope. For example, they can select a module, a package, or a class as the instrumentation scope.
So for requestHandlerBase, the scopeName in SolrMetricsScope should maybe be the package name of org.apache.solr.handler. If we really want to keep registries, then we can add a label called registry but I don't know if on an attribute based framework we still need the registry tag? Core metrics for example already have the core name as a label making it unique.
"Context" and "Scope" are almost synonymous... "Context" just potentially holds more than a scope. I'm not sure a rename is warranted; let's table that. Better to do big renames at the end to make the change easier to review beforehand, then any wrote renames that we think are warranted. Perhaps in a completely separate PR/commit.
I read about instrumentation scope; thanks. (I edited my previous response). I don't think there's much point in using a Java package like thing here. I think scope is intended to basically refer to all of Solr (e.g. be something like "org.apache.solr"), possibly with module consideration added (e.g. add ".llm").
Remember AB's comment. To add/clarify something he's getting at is the need to unregister metrics related to a core. The DropWizard MetricsRegistry per core was a way to do that. I chatted with Google Gemini about this matter, and it recommended creating a dedicated SdkMeterProvider in place of a DropWizard MetricRegistry.
I think scope is intended to basically refer to all of Solr (e.g. be something like "org.apache.solr"), possibly with module consideration added (e.g. add ".llm").
Actually this makes sense. If you have a single place where many different metrics are stored, this scope name might be the differentiator especially if all your systems have some general metric like http_requests_total. So then all scope should be org.apache.solr technically. Do you still find value in keeping the concept of registry as a label on all metrics then?
If not, I don't see the need to keep SolrMetricsContext and we should just create metrics through SolrMetricManager because that is currently all I am using it for. I don't see much value in aggregating on a label like registry except for solr.core.* but we have core as a label already.
We don't need an attribute for the registry so long as our metrics are reasonably differentiated / clear. We can change our mind in the future with basically no disturbance, I suppose.
I wouldn't get rid of SolrMetricsContext yet. AB also kind of recommended keeping it to reduce the review/impact.
So I didn't move attributes into SolrMetricsContext. Ended up being a bigger refactor because currently keeping Dropwizard in parallel temporarily to incrementally migrate. Maybe will do it closer to the end of this migration depending what happens to SolrMetricsContext.
I have another PR pretty close to ready and would like to put that up for review.
It adds in dynamic solr core metric creation and deletion with the dedicated SdkMeterProvider and the correct otel scope name as well as adds a bunch of tests around these foundational components. There are some larger changes in-order to make this work that you'd be probably be interested in.
I also have a WIP for some basic filtering but there are some "gotchas" that we need to discuss but will put it aside for a separate discussion.
If you have nothing else here, I'd like to merge this and put up the new PR.