datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Remove codahale add micrometer

Open djordje-mijatovic opened this issue 2 years ago • 2 comments

Checklist

  • [ ] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • [ ] Links to related issues (if applicable)
  • [ ] Tests for the changes have been added/updated (if applicable)
  • [ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • [ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

djordje-mijatovic avatar Nov 18 '22 15:11 djordje-mijatovic

Unit Test Results (build & test)

359 tests   - 262   359 :heavy_check_mark:  - 258   1m 3s :stopwatch: - 14m 56s   98 suites  -   59       0 :zzz:  -     4    98 files    -   59       0 :x: ±    0 

Results for commit 1ec35458. ± Comparison against base commit 250f7ce1.

This pull request removes 262 tests.
com.datahub.authentication.authenticator.AuthenticatorChainTest ‑ testAuthenticateFailure
com.datahub.authentication.authenticator.AuthenticatorChainTest ‑ testAuthenticateSuccess
com.datahub.authentication.authenticator.AuthenticatorChainTest ‑ testAuthenticateThrows
com.datahub.authentication.authenticator.DataHubSystemAuthenticatorTest ‑ testAuthenticateFailureMismatchingCredentials
com.datahub.authentication.authenticator.DataHubSystemAuthenticatorTest ‑ testAuthenticateFailureMissingAuthorizationHeader
com.datahub.authentication.authenticator.DataHubSystemAuthenticatorTest ‑ testAuthenticateFailureMissingBasicCredentials
com.datahub.authentication.authenticator.DataHubSystemAuthenticatorTest ‑ testAuthenticateSuccessDelegatedActor
com.datahub.authentication.authenticator.DataHubSystemAuthenticatorTest ‑ testAuthenticateSuccessNoDelegatedActor
com.datahub.authentication.authenticator.DataHubSystemAuthenticatorTest ‑ testInit
com.datahub.authentication.authenticator.DataHubTokenAuthenticatorTest ‑ testAuthenticateFailureInvalidToken
…

github-actions[bot] avatar Nov 18 '22 18:11 github-actions[bot]

hey @djordje-mijatovic,

Thanks for the contribution.

  • Any change that changes exposed metrics at this level, needs to be feature flagged, so teams relaying on the metrics for alerting and dashboarding can rollout whenever they want.
  • We agree that dropwizzard is not ideal, and planning to add proper prometheus support with labels, histograms.

I'm keen know what are the problems you are solving with this change.

szalai1 avatar Nov 29 '22 16:11 szalai1

Hey @djordje-mijatovic have you had a chance to take a look at Peter's feedback?

aditya-radhakrishnan avatar Dec 19 '22 17:12 aditya-radhakrishnan

Hey @djordje-mijatovic have you had a chance to take a look at Peter's feedback?

Who is Peter?

If you mean the comment above, what this PR solves... I have removed CodaHale and instead of it, I have added Micrometer. @mattmatravers I suppose you can provide more details why we did it.

djordje-mijatovic avatar Dec 20 '22 10:12 djordje-mijatovic

Who is Peter?

I think that's @szalai1.

If you mean the comment above, what this PR solves... I have removed CodaHale and instead of it, I have added Micrometer. @mattmatravers I suppose you can provide more details why we did it.

My understanding here is that CodeHale requires an extra JMX component to expose its metrics, and we'd like to reduce the overall API/vulnerability surface of DataHub. Micrometer is also the standard for metrics in Spring Boot so we'd rather not be using two standards where possible.

mattmatravers avatar Dec 20 '22 10:12 mattmatravers

  • Any change that changes exposed metrics at this level, needs to be feature flagged, so teams relaying on the metrics for alerting and dashboarding can rollout whenever they want.

A feature flag would be ideal for backwards compatibility. How much more work would this involve, @djordje-mijatovic?

mattmatravers avatar Dec 20 '22 10:12 mattmatravers

A feature flag would be ideal for backwards compatibility. How much more work would this involve, @djordje-mijatovic?

We can't do that as the whole CodaHale code is already removed. Reverting CodaHale code is an option, but it will have a very negative effect on the performance so I wouldn't go this way.

djordje-mijatovic avatar Dec 20 '22 11:12 djordje-mijatovic

@djordje-mijatovic I think you'll need to merge changes from master to fix the red builds.

Going over this again, I have some suspicion that we might not have an exactly like for like replacement, e.g. LongTaskTimer has a very special way of working if I recall correctly. We could add unit tests to prove we're recording it correctly or add feature flags, but this would add a lot of extra effort.

@szalai1 do you have any suggestions for a way forward here? Ideally this would be a drop-in replacement I don't think there are any existing unit tests.

mattmatravers avatar Dec 22 '22 22:12 mattmatravers

@mattmatravers @djordje-mijatovic yes, I'm Peter.

My understanding here is that CodeHale requires an extra JMX component to expose its metrics, and we'd like to reduce the overall API/vulnerability surface of DataHub. Micrometer is also the standard for metrics in Spring Boot so we'd rather not be using two standards where possible.

This makes sense. JMX also provides extra JVM metrics, which are (can be) used for alerting, monitoring. Do you use these metrics? If not you can disable monitoring in helm and the jmx exporter won't be loaded in runtime (it's not even needed in the container), if yes with this change they will disappear.

This change is ok, if

  • it produces identical metrics (same metric names and labels) as before
  • or change is behind a feature flag

Metrics diffs I observed:

  • missing jmx metrics (jvm_info)
  • names are structured differently (commented above)
  • labels are missing (e.g type="histogram" )

Notes:

  • It's on our backlog to fully rewrite how we measure things in datahub, to fully utilise Prometheus. (e.g the current metrics calculate latency percentiles in the app, which should be done in Prometheus. ), it's more work than just replacing a lib and will break/fix metrics.
  • Changes on metric endpoint (port, path), needs to be updated in datahub-helm too ( I can do that)
  • KafkaHealthChecker.java L40 needs to be updated too, builds fails
  • These dashboards supposed to work after the change: https://github.com/datahub-project/datahub/tree/master/docker/monitoring/grafana/dashboards
  • Docs: https://github.com/datahub-project/datahub/tree/master/docker/monitoring/grafana/dashboards

szalai1 avatar Dec 27 '22 10:12 szalai1

We talked about this with the team. Sharing our plans to get aligned. These are on our backlog, but not the highest priority at the moment:

  • we would like to use metrics native to Prometheus in the long-run with labels and distributed histograms. Probably we are going to use: https://github.com/prometheus/client_java , with/or docs, this requires reimplemented all metrics to assign appropriate labels and having naming conventions.
  • we would like to roll the change with flag to turn the new monitoring on and off, and not breaking the current way for those who use it in production.
  • keeping JMX exporter optional, because it's not required for internal metics, but provides manny valuable metrics about the JVM.

szalai1 avatar Dec 28 '22 08:12 szalai1

@szalai1

Hello. Thank you for your comments. In the meantime we talk that it would be much safer to do it step by step so I have create another PR: https://github.com/datahub-project/datahub/pull/6889 This one should be more aligned and it should be testable. The Codahale is not removed but it is encapsulate. Only in the MetricUtils class you can see Codahale imports. In the same file I have added the Micrometer register and wrote MetricUtilsTest to comfirm both metric works the same way. Unfortunately, metric are a bit different so I had to introduce some modifications - everythin is documents in JavaDoc comments in MetricUtilsTest file. In theory this PR should not effect on CodaHale metrics, it should just add Micrometer metrics too. Once you are satisfied with the results we could remove Codahale or add a feature switch and leave it.

djordje-mijatovic avatar Dec 29 '22 14:12 djordje-mijatovic