datahub
datahub copied to clipboard
Remove codahale add micrometer
Checklist
- [ ] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
- [ ] Links to related issues (if applicable)
- [ ] Tests for the changes have been added/updated (if applicable)
- [ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
- [ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub
Unit Test Results (build & test)
359 tests - 262 359 :heavy_check_mark: - 258 1m 3s :stopwatch: - 14m 56s 98 suites - 59 0 :zzz: - 4 98 files - 59 0 :x: ± 0
Results for commit 1ec35458. ± Comparison against base commit 250f7ce1.
This pull request removes 262 tests.
com.datahub.authentication.authenticator.AuthenticatorChainTest ‑ testAuthenticateFailure
com.datahub.authentication.authenticator.AuthenticatorChainTest ‑ testAuthenticateSuccess
com.datahub.authentication.authenticator.AuthenticatorChainTest ‑ testAuthenticateThrows
com.datahub.authentication.authenticator.DataHubSystemAuthenticatorTest ‑ testAuthenticateFailureMismatchingCredentials
com.datahub.authentication.authenticator.DataHubSystemAuthenticatorTest ‑ testAuthenticateFailureMissingAuthorizationHeader
com.datahub.authentication.authenticator.DataHubSystemAuthenticatorTest ‑ testAuthenticateFailureMissingBasicCredentials
com.datahub.authentication.authenticator.DataHubSystemAuthenticatorTest ‑ testAuthenticateSuccessDelegatedActor
com.datahub.authentication.authenticator.DataHubSystemAuthenticatorTest ‑ testAuthenticateSuccessNoDelegatedActor
com.datahub.authentication.authenticator.DataHubSystemAuthenticatorTest ‑ testInit
com.datahub.authentication.authenticator.DataHubTokenAuthenticatorTest ‑ testAuthenticateFailureInvalidToken
…
hey @djordje-mijatovic,
Thanks for the contribution.
- Any change that changes exposed metrics at this level, needs to be feature flagged, so teams relaying on the metrics for alerting and dashboarding can rollout whenever they want.
- We agree that dropwizzard is not ideal, and planning to add proper prometheus support with labels, histograms.
I'm keen know what are the problems you are solving with this change.
Hey @djordje-mijatovic have you had a chance to take a look at Peter's feedback?
Hey @djordje-mijatovic have you had a chance to take a look at Peter's feedback?
Who is Peter?
If you mean the comment above, what this PR solves... I have removed CodaHale and instead of it, I have added Micrometer. @mattmatravers I suppose you can provide more details why we did it.
Who is Peter?
I think that's @szalai1.
If you mean the comment above, what this PR solves... I have removed CodaHale and instead of it, I have added Micrometer. @mattmatravers I suppose you can provide more details why we did it.
My understanding here is that CodeHale requires an extra JMX component to expose its metrics, and we'd like to reduce the overall API/vulnerability surface of DataHub. Micrometer is also the standard for metrics in Spring Boot so we'd rather not be using two standards where possible.
- Any change that changes exposed metrics at this level, needs to be feature flagged, so teams relaying on the metrics for alerting and dashboarding can rollout whenever they want.
A feature flag would be ideal for backwards compatibility. How much more work would this involve, @djordje-mijatovic?
A feature flag would be ideal for backwards compatibility. How much more work would this involve, @djordje-mijatovic?
We can't do that as the whole CodaHale code is already removed. Reverting CodaHale code is an option, but it will have a very negative effect on the performance so I wouldn't go this way.
@djordje-mijatovic I think you'll need to merge changes from master to fix the red builds.
Going over this again, I have some suspicion that we might not have an exactly like for like replacement, e.g. LongTaskTimer has a very special way of working if I recall correctly. We could add unit tests to prove we're recording it correctly or add feature flags, but this would add a lot of extra effort.
@szalai1 do you have any suggestions for a way forward here? Ideally this would be a drop-in replacement I don't think there are any existing unit tests.
@mattmatravers @djordje-mijatovic yes, I'm Peter.
My understanding here is that CodeHale requires an extra JMX component to expose its metrics, and we'd like to reduce the overall API/vulnerability surface of DataHub. Micrometer is also the standard for metrics in Spring Boot so we'd rather not be using two standards where possible.
This makes sense. JMX also provides extra JVM metrics, which are (can be) used for alerting, monitoring. Do you use these metrics? If not you can disable monitoring in helm and the jmx exporter won't be loaded in runtime (it's not even needed in the container), if yes with this change they will disappear.
This change is ok, if
- it produces identical metrics (same metric names and labels) as before
- or change is behind a feature flag
Metrics diffs I observed:
- missing jmx metrics (jvm_info)
- names are structured differently (commented above)
- labels are missing (e.g type="histogram" )
Notes:
- It's on our backlog to fully rewrite how we measure things in datahub, to fully utilise Prometheus. (e.g the current metrics calculate latency percentiles in the app, which should be done in Prometheus. ), it's more work than just replacing a lib and will break/fix metrics.
- Changes on metric endpoint (port, path), needs to be updated in
datahub-helm
too ( I can do that) - KafkaHealthChecker.java L40 needs to be updated too, builds fails
- These dashboards supposed to work after the change: https://github.com/datahub-project/datahub/tree/master/docker/monitoring/grafana/dashboards
- Docs: https://github.com/datahub-project/datahub/tree/master/docker/monitoring/grafana/dashboards
We talked about this with the team. Sharing our plans to get aligned. These are on our backlog, but not the highest priority at the moment:
- we would like to use metrics native to Prometheus in the long-run with labels and distributed histograms. Probably we are going to use: https://github.com/prometheus/client_java , with/or docs, this requires reimplemented all metrics to assign appropriate labels and having naming conventions.
- we would like to roll the change with flag to turn the new monitoring on and off, and not breaking the current way for those who use it in production.
- keeping JMX exporter optional, because it's not required for internal metics, but provides manny valuable metrics about the JVM.
@szalai1
Hello. Thank you for your comments. In the meantime we talk that it would be much safer to do it step by step so I have create another PR: https://github.com/datahub-project/datahub/pull/6889 This one should be more aligned and it should be testable. The Codahale is not removed but it is encapsulate. Only in the MetricUtils class you can see Codahale imports. In the same file I have added the Micrometer register and wrote MetricUtilsTest to comfirm both metric works the same way. Unfortunately, metric are a bit different so I had to introduce some modifications - everythin is documents in JavaDoc comments in MetricUtilsTest file. In theory this PR should not effect on CodaHale metrics, it should just add Micrometer metrics too. Once you are satisfied with the results we could remove Codahale or add a feature switch and leave it.