SOLR-10654: Introduce output of Prometheus metrics directly from Solr
https://issues.apache.org/jira/browse/SOLR-10654
Description
Introduce a new Prometheus response writer /solr/admin/metrics?wt=prometheus to output Dropwizards Solr Core registry metrics directly from Solr. Output looks similar to the prometheus exporter. This exports solr.core.* solr.node solr.jvm solr.jetty metric registries
Solution
Currently, Solr metrics are collected by Dropwizard and differentiated by metric names. This is an anti-pattern to Prometheus. Manually export Dropwizard Solr core registry metrics to a new SolrPrometheusExporter base class to parse out the labels from the metric name and output in a Prometheus readable format for the Prometheus Response writer. The exporter is just temporarily created to export a snapshot of what the current Dropwizard registry is at. These are not held in memory.
The exporter will categorize these metrics and export to prometheus format with labels depending on the category and metric names.
In standalone, all labels for the solr.core registry includescore. In cloud mode, all metrics include core, collection, shard, replica
NOTE: Not all metrics are possible to convert to Prometheus such as Dropwizard string metrics. This is because Prometheus does not support non-numeric types
Tests
Added MetricsHandlerTest, SolrPrometheusExporterTest, TestPrometheusReponseWriter
Checklist
Please review the following and check all that apply:
- [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
- [x] I have created a Jira issue and added the issue ID to my pull request title.
- [x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
- [x] I have developed this patch against the
mainbranch. - [x] I have run
./gradlew check. - [x] I have added tests for my changes.
- [x] I have added documentation for the Reference Guide
Moving this out of draft for review
There are enough additional dependencies that I feel strongly this should be another module.
WDYT?
There are enough additional dependencies that I feel strongly this should be another module.
WDYT?
Sorry, I somehow completely missed this comment. When you say be another module, do you mean move it to solr/modules that needs to be enabled? If so, I think that prometheus metrics is valuable enough to be part of core and enabled by default but I also only added 2 additional dependencies. Is that enough to be its own module?
io.prometheus:prometheus-metrics-config:1.1.0 (1 constraints: 3e164ed8)
io.prometheus:prometheus-metrics-exposition-formats:1.1.0 (1 constraints: 0405f335)
io.prometheus:prometheus-metrics-model:1.1.0 (2 constraints: 411b133b)
io.prometheus:prometheus-metrics-shaded-protobuf:1.1.0 (1 constraints:
No, there are 4 dependencies. Only the last one concerns me. Why is protobuf needed at all? Can it be excluded? Can any of the others be excluded as well, while you're at it?
io.prometheus:prometheus-metrics-config:1.1.0 (1 constraints: 3e164ed8) io.prometheus:prometheus-metrics-exposition-formats:1.1.0 (1 constraints: 0405f335) io.prometheus:prometheus-metrics-model:1.1.0 (2 constraints: 411b133b) io.prometheus:prometheus-metrics-shaded-protobuf:1.1.0 (1 constraints:No, there are 4 dependencies. Only the last one concerns me. Why is protobuf needed at all? Can it be excluded? Can any of the others be excluded as well, while you're at it?
Ah ok you were correct. I went ahead and removed the transitive dependencies prometheus-metrics-shaded-protobuf and prometheus-metrics-config so there should now only be 2
@dsmiley or anyone else interested in reviewing. Was curious for more feedback on approach. Is this going in the correct direction or do some disagree if this is correct implementation path towards prometheus in Solr.
Much better. I know I've been a bit picky/detailed but I'm trying to achieve a level of clarity & conciseness here that will end up being repeated for the other metrics.
What do you think the next step should be? Totally completing the core metrics so that you have parity with the default Prometheus Exporter config?
No worries, I'd prefer you be picky than someone in the future seeing this and it makes no sense to them. I would say next steps are to export the other 3 registries solr.jetty, solr.jvm, solr.node and yes getting as close to the prometheus exporters config as possible. Luckily, in my opinion this was the hardest registry (hopefully I was right) and the ground work from this would be in place to do the rest.
What's here looks good to me. Just need to add more; lots more metrics. Maybe it'll evolve a little once you add another category. My interest on this at-work is not the core metrics but all the other ones (Solr, JVM, Jetty). I'd then be inclined to take this to our Solr fork to kick the tires for real world usage.
What's here looks good to me. Just need to add more; lots more metrics. Maybe it'll evolve a little once you add another category. My interest on this at-work is not the core metrics but all the other ones (Solr, JVM, Jetty). I'd then be inclined to take this to our Solr fork to kick the tires for real world usage.
Thanks. I've been caught up in some other things recently but plan on continuing on the other registries somewhat soon and it shouldn't be as bad to export. Would it be better for me to continue and commit them to this branch, or was this planned to be merged and export the other 3 registries in a separate PR?
It doesn't feel right to merge this PR right now. Needs docs. Also the choice of what metrics are done so far is inverted from the value that would make most sense for this whole approach -- OS/JVM/node issues are highest value.
It doesn't feel right to merge this PR right now. Needs docs. Also the choice of what metrics are done so far is inverted from the value that would make most sense for this whole approach -- OS/JVM/node issues are highest value.
@dsmiley Just finished up exporting solr.node solr.jvm and solr.jetty metrics to prometheus and is ready for another round of review whenever you are. I updated the description above and attached files if you want to see what the prometheus output looks like. Last thing I did not finish writing up is docs. I'm assuming the 2 places to update would be the metrics-api section and possibly Monitoring with Prometheus and Grafana?
Added ref-guide documentation on Prometheus endpoint
I'd like to suggest an integration test showing all the output (say to a file) stripped of numbers (to remove brittle/changing stuff) to show that we export what we mean to export. I could imagine someone renaming some metric and then forgetting to update the code here.
Thanks for reviewing again. Added in an integration test with a file stripped of values. Also the unit tests I wrote originally should catch if any changes that happen to the naming of the prometheus metric or labels.
Also thanks for catching missing javadocs. Made sure to add them in for each class and added sample values for what is being exported to prometheus.
I tried this PR locally using the "techproducts" example. I thing I noticed is that the metrics filtering appears to not work, which is a big problem. A test should be added for this, and it should work.
http://localhost:8983/solr/admin/metrics?category=solr.jetty&wt=prometheus
Filtering does work I think the parameter you are looking for is group. Try localhost:8983/solr/admin/metrics?group=jvm&wt=prometheus and you can change it with jetty node core for the corresponding registries
But still you are correct, I should have a test for this. Will add one in.
The only other thing I remember now about this is that regex, key, expr parameters for filtering will not work with this prometheus wt. I did that on purpose because those 3 parameters are for filtering based on the name of the metric while the others are kind of more generic such as registry or type filtering. I didn't think it made sense because prometheus naming is different from the dropwizard naming after exporting. For example: solr\.core\..*:QUERY\..*\.requestTimes:max_ms is not an output of prometheus but to allow expr filtering around that felt odd.
Even then if you need extensive filtering or trying to get some specific metric, maybe the Prometheus Exporter process with a custom config might be the better solution.
Oops; maybe I was hurried. I'm glad the category works at least! The other filtering options could be deferred to a future JIRA by matching the prometheus name. It's a very useful capability; some metrics implemented as gauges are expensive to compute, and others are just excessive output that aren't needed for O(N) for N cores if N is for some users quite high.
Oops; maybe I was hurried. I'm glad the category works at least! The other filtering options could be deferred to a future JIRA by matching the prometheus name. It's a very useful capability; some metrics implemented as gauges are expensive to compute, and others are just excessive output that aren't needed for O(N) for N cores if N is for some users quite high.
Added in a test for the filtering. I'd be happy expand on this and add further filtering feature specific to prometheus after this is ready to be merged.
This is super close now. Can you propose a brief CHANGES.txt under 9.7 "New Features" (I think it's closer to this than an improvement).
Added entry into CHANGES.txt