Provide missing user event metrics from aerogear/keycloak-metrics-spi with a key…
…cloak mircometer event listener
inspired by https://github.com/aerogear/keycloak-metrics-spi https://github.com/please-openit/keycloak-native-metrics https://github.com/thomasdarimont/keycloak-project-example/tree/main/keycloak/extensions/src/main/java/com/github/thomasdarimont/keycloak/custom/metrics
Preview of the docs: https://github.com/bohmber/keycloak/blob/issue-33043/docs/guides/server/event-metrics.adoc
fixes: #33043
Unreported flaky test detected
If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.
org.keycloak.testsuite.webauthn.registration.passwordless.PwdLessResidentKeyRegTest#residentKeyNotRequiredPresent
Keycloak CI - WebAuthn IT (chrome)
java.lang.AssertionError: Expected RegisterPage but was Sign in to test (https://localhost:8543/auth/realms/test/protocol/openid-connect/auth?response_type=code&client_id=test-app&redirect_uri=https%3A%2F%2Flocalhost%3A8543%2Fauth%2Frealms%2Fmaster%2Fapp%2Fauth&state=f892fe75-87a3-4861-8834-4585a812b5db&scope=openid)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.keycloak.testsuite.pages.AbstractPage.assertCurrent(AbstractPage.java:47)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
...
TODOs: Metrics names prefix vendor.kc or vendor or keycloak or vendor.keycloak? @ahus1 suggested the prefix keycloak. Which metrics should be part of the basic configuration?
Unreported flaky test detected
If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.
org.keycloak.testsuite.webauthn.registration.passwordless.PwdLessOtherSettingsTest#defaultValues
Keycloak CI - WebAuthn IT (chrome)
java.lang.AssertionError: Expected RegisterPage but was Sign in to test (https://localhost:8543/auth/realms/test/protocol/openid-connect/auth?response_type=code&client_id=test-app&redirect_uri=https%3A%2F%2Flocalhost%3A8543%2Fauth%2Frealms%2Fmaster%2Fapp%2Fauth&state=3ef29f94-72df-44b9-94e0-d7bcc7e568f4&scope=openid)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.keycloak.testsuite.pages.AbstractPage.assertCurrent(AbstractPage.java:47)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
...
org.keycloak.testsuite.webauthn.registration.passwordless.PwdLessPolicyJsInjectionTest#relyingPartyEntityName
Keycloak CI - WebAuthn IT (chrome)
java.lang.AssertionError: Expected RegisterPage but was Sign in to test (https://localhost:8543/auth/realms/test/protocol/openid-connect/auth?response_type=code&client_id=test-app&redirect_uri=https%3A%2F%2Flocalhost%3A8543%2Fauth%2Frealms%2Fmaster%2Fapp%2Fauth&state=63187ca7-8d4c-4901-ab1b-f31b287b839d&scope=openid)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.keycloak.testsuite.pages.AbstractPage.assertCurrent(AbstractPage.java:47)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
...
@bohmber - thank you for this PR. Please provide an example output of the metrics you created.
For the prefix, I'd go with keycloak.
@bohmber - another note: We would need some docs for this similar to what the Aerogear extension provides. If you re-use documentation or code from that repo, you need to attribute the changes to the original authors. For now, use the co-author in the commits. I might need to figure out how to do this properly to comply with our DCO.
@ahus1 documentation is a good topic. I need to figure out what makes sense. For the legal stuff I think I have to recreate the pr/commits with all co-authors
Unreported flaky test detected
If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.
org.keycloak.testsuite.webauthn.registration.passwordless.PwdLessPolicyJsInjectionTest#relyingPartyEntityName
Keycloak CI - WebAuthn IT (chrome)
java.lang.AssertionError: Expected RegisterPage but was Sign in to test (https://localhost:8543/auth/realms/test/protocol/openid-connect/auth?response_type=code&client_id=test-app&redirect_uri=https%3A%2F%2Flocalhost%3A8543%2Fauth%2Frealms%2Fmaster%2Fapp%2Fauth&state=6f2f477b-1cde-4b31-b6f8-306d544f70d6&scope=openid)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.keycloak.testsuite.pages.AbstractPage.assertCurrent(AbstractPage.java:47)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
...
org.keycloak.testsuite.webauthn.registration.passwordless.PwdLessResidentKeyRegTest#residentKeyNotRequiredPresent
Keycloak CI - WebAuthn IT (chrome)
java.lang.AssertionError: Expected RegisterPage but was Sign in to test (https://localhost:8543/auth/realms/test/protocol/openid-connect/auth?response_type=code&client_id=test-app&redirect_uri=https%3A%2F%2Flocalhost%3A8543%2Fauth%2Frealms%2Fmaster%2Fapp%2Fauth&state=b0687469-e672-4db0-b1e5-c654102f8cf8&scope=openid)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.keycloak.testsuite.pages.AbstractPage.assertCurrent(AbstractPage.java:47)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
...
@bohmber - I see you rebased or merged the main branch into the PR. Please avoid this unless you see conflicts or are being asked by a maintainer to do so as it avoid unnecessary rebuilds of the branch and notifications to the maintainers. Thanks!
@ahus1 That is fine for me. I added some basic documentation and you can disable AdminEvents and Events and the metrics enabled will be checked by the factory
Unreported flaky test detected
If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.
org.keycloak.testsuite.webauthn.WebAuthnIdlessTest#testWebAuthnIDLessWithNoWebAuthnPasswordlessCredentialLogin
Keycloak CI - WebAuthn IT (chrome)
java.lang.AssertionError:
Expected: a string containing "Failed to authenticate by the Passkey."
but: was null
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
...
Unreported flaky test detected
If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.
org.keycloak.testsuite.webauthn.WebAuthnPropertyTest#residentKey
Keycloak CI - WebAuthn IT (chrome)
java.lang.AssertionError: Expected LoginPage but was AUTH_RESPONSE (https://localhost:8543/auth/realms/master/app/auth?state=aa29ab71-9ffb-4999-90ac-252b45635131&session_state=62399a51-fcdd-4379-9599-d3ad1c3214a1&iss=https%3A%2F%2Flocalhost%3A8543%2Fauth%2Frealms%2Ftest&code=dbe211be-9589-4837-8319-86cd60dd3ad8.62399a51-fcdd-4379-9599-d3ad1c3214a1.510362cc-92bd-4fd4-938b-bc38bf52f1d8)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.keycloak.testsuite.pages.AbstractPage.assertCurrent(AbstractPage.java:47)
at org.keycloak.testsuite.pages.LoginPage.open(LoginPage.java:262)
...
Unreported flaky test detected
If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.
org.keycloak.testsuite.actions.RequiredActionUpdateProfileTest#updateProfile
Keycloak CI - Forms IT (chrome)
java.lang.RuntimeException: Could not create statement
at org.jboss.arquillian.junit.Arquillian.methodBlock(Arquillian.java:307)
at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
...
The content looks like this:
# TYPE keycloak_admin_event counter
# HELP keycloak_admin_event
keycloak_admin_event_total{operation="CREATE",realm="master",resource="CLIENT"} 1.0
keycloak_admin_event_total{operation="DELETE",realm="master",resource="CLIENT"} 1.0
# TYPE keycloak_event_logout counter
# HELP keycloak_event_logout
keycloak_event_logout_total{realm="master",result="success"} 1.0
# TYPE keycloak_event_code_to_token counter
# HELP keycloak_event_code_to_token
keycloak_event_code_to_token_total{client_id="security-admin-console",error="",provider="keycloak",realm="master",result="success"} 1.0
# TYPE keycloak_event_login counter
# HELP keycloak_event_login
keycloak_event_login_total{client_id="security-admin-console",error="",provider="keycloak",realm="master",result="success"} 1.0
keycloak_event_login_total{client_id="client_not_found",error="client_not_found",provider="keycloak",realm="master",result="error"} 1.0
# TYPE keycloak_event_refresh_token counter
# HELP keycloak_event_refresh_token
keycloak_event_refresh_token_total{client_id="security-admin-console",error="",provider="keycloak",realm="master",result="success"} 2.0
Unreported flaky test detected
If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.
org.keycloak.testsuite.webauthn.registration.PubKeySignRegisterTest#publicKeySignaturesEmpty
Keycloak CI - WebAuthn IT (chrome)
java.lang.AssertionError: Expected RegisterPage but was Sign in to test (https://localhost:8543/auth/realms/test/protocol/openid-connect/auth?response_type=code&client_id=test-app&redirect_uri=https%3A%2F%2Flocalhost%3A8543%2Fauth%2Frealms%2Fmaster%2Fapp%2Fauth&state=63bada11-1d01-4614-94b1-dcc983e9b9f5&scope=openid)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.keycloak.testsuite.pages.AbstractPage.assertCurrent(AbstractPage.java:47)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
...
Unreported flaky test detected
If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.
org.keycloak.testsuite.webauthn.WebAuthnPropertyTest#requiredActionRegistration
Keycloak CI - WebAuthn IT (chrome)
java.lang.AssertionError: Expected LoginPage but was AUTH_RESPONSE (https://localhost:8543/auth/realms/master/app/auth?state=e6f24e12-6924-4c6a-b621-802cc10b40c5&session_state=33a75bd6-ef21-49fa-b4aa-65fce03ddaaf&iss=https%3A%2F%2Flocalhost%3A8543%2Fauth%2Frealms%2Ftest&code=c6032f6e-c651-47c8-9de3-130064af791c.33a75bd6-ef21-49fa-b4aa-65fce03ddaaf.e877ca0b-4bce-468b-8864-49eeb8bf098f)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.keycloak.testsuite.pages.AbstractPage.assertCurrent(AbstractPage.java:47)
at org.keycloak.testsuite.pages.LoginPage.open(LoginPage.java:262)
...
org.keycloak.testsuite.webauthn.registration.PolicyJsInjectionTest#requireResidentKey
Keycloak CI - WebAuthn IT (chrome)
java.lang.AssertionError: Expected RegisterPage but was Sign in to test (https://localhost:8543/auth/realms/test/protocol/openid-connect/auth?response_type=code&client_id=test-app&redirect_uri=https%3A%2F%2Flocalhost%3A8543%2Fauth%2Frealms%2Fmaster%2Fapp%2Fauth&state=37fc64da-9a07-4e7a-a5fd-a0a71493e9a9&scope=openid)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.keycloak.testsuite.pages.AbstractPage.assertCurrent(AbstractPage.java:47)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
...
org.keycloak.testsuite.forms.LevelOfAssuranceFlowTest#testWithOTPAndRecoveryCodesAtLevel2
Keycloak CI - Forms IT (chrome)
java.lang.AssertionError: Event expected
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.junit.Assert.assertNotNull(Assert.java:713)
at org.keycloak.testsuite.AssertEvents.poll(AssertEvents.java:91)
...
org.keycloak.testsuite.forms.RegisterTest#registerUserUmlats
Keycloak CI - Forms IT (chrome)
java.lang.AssertionError: Expected RegisterPage but was Sign in to test (https://localhost:8543/auth/realms/test/protocol/openid-connect/auth?response_type=code&client_id=test-app&redirect_uri=https%3A%2F%2Flocalhost%3A8543%2Fauth%2Frealms%2Fmaster%2Fapp%2Fauth&state=b3ab3925-0ae4-4ddb-832e-f6a8164804d9&scope=openid)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.keycloak.testsuite.pages.AbstractPage.assertCurrent(AbstractPage.java:47)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
...
I had a look at this today with some more time, and here are some ideas/comments:
- I'd keep the provider tag, but probably rename it "idp", with the default value being an empty string, and not "keycloak"
- When I compare the admin events with the events around login, I'd say the are less useful. Could we drop the admin events for this first release?
- Instead of "buildCounterName" appending the name of the event to the metric name, I'd like to se the event type as a label. This would allow simpler handling in aggregate queries and counting events.
- Looking at some real examples, I see that "result" is probably redundant, as all successful entries will have an "error" set to an empty string. I know I asked for that earlier. What do you think about dropping the "result" again, and just keeping "error"? At the same time there might be an error without an error message, so in such a case we should set the error label to be "unknown".
- Looking at the metrics in the metrics overview, they don't have a description. AFAIK adding a description is only possible when using the
Counter.builder(). Not sure what the best practice is here ... maybe using a (concurrent) hash map with compute-if-absent to not call the builder every time? Still, it might be cleaner to call the builder every time - let's hope it is fast enough.
The listener needs to be added in each realm. I wonder if we can evolve this to be a CLI option, so you don't need to add it to each realm. Let me hear your thoughts on this.
- I'd keep the provider tag, but probably rename it "idp", with the default value being an empty string, and not "keycloak"
Done
- When I compare the admin events with the events around login, I'd say the are less useful. Could we drop the admin events for this first release?
Yes, we can skip it for the first release
- Instead of "buildCounterName" appending the name of the event to the metric name, I'd like to se the event type as a label. This would allow simpler handling in aggregate queries and counting events.
Event as label/tag could be the better option. But then we need different Meter names for the event with or without more details. There is a TODO in the code about the Meter names. In the PR there is the Meter name keycloak_events_total and keycloak_simple_events_total
- Looking at some real examples, I see that "result" is probably redundant, as all successful entries will have an "error" set to an empty string. I know I asked for that earlier. What do you think about dropping the "result" again, and just keeping "error"? At the same time there might be an error without an error message, so in such a case we should set the error label to be "unknown".
With the event as tag/label result is not needed anymore.
- Looking at the metrics in the metrics overview, they don't have a description. AFAIK adding a description is only possible when using the
Counter.builder(). Not sure what the best practice is here ... maybe using a (concurrent) hash map with compute-if-absent to not call the builder every time? Still, it might be cleaner to call the builder every time - let's hope it is fast enough.
The current preference is using the builder. What should be in the description?
The listener needs to be added in each realm. I wonder if we can evolve this to be a CLI option, so you don't need to add it to each realm. Let me hear your thoughts on this.
We could do something in EventBuilder to include the metrics eventlistener when enabled. But our configuration is automated there is not strong demand for the CLI option on our side. If you agree we could try to modify the EventBuilder to support a cli option.
Here is the example output with the latest changes:
# TYPE keycloak_simple_events counter
# HELP keycloak_simple_events The total number of Keycloak events
keycloak_simple_events_total{event="update_credential",realm="master"} 1.0
# TYPE keycloak_events counter
# HELP keycloak_events The total number of Keycloak events
keycloak_events_total{client_id="security-admin-console",error="",event="code_to_token",idp="",realm="master"} 6.0
keycloak_events_total{client_id="security-admin-console",error="",event="logout",idp="",realm="master"} 2.0
keycloak_events_total{client_id="security-admin-console",error="",event="login",idp="",realm="master"} 6.0
keycloak_events_total{client_id="security-admin-console",error="",event="refresh_token",idp="",realm="master"} 2.0
keycloak_events_total{client_id="client_not_found",error="client_not_found",event="login_error",idp="",realm="master"} 1.0
Unreported flaky test detected
If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.
org.keycloak.testsuite.federation.ldap.LDAPUserLoginTest#loginLDAPUserCredentialVaultAuthenticationNoneEncryptionSSL
Keycloak CI - Java Distribution IT (windows-latest - temurin - 17)
java.lang.AssertionError: Expected AppPage but was Sign in to test (https://localhost:8543/auth/realms/test/login-actions/authenticate?execution=2f9fb67a-32cb-4b48-9039-9fae4cfe8286&client_id=test-app&tab_id=_AHOlMYrETQ&client_data=eyJydSI6Imh0dHBzOi8vbG9jYWxob3N0Ojg1NDMvYXV0aC9yZWFsbXMvbWFzdGVyL2FwcC9hdXRoIiwicnQiOiJjb2RlIiwic3QiOiI4OTAzNGRkNS05MzFkLTRhM2YtOTM2ZC03YzIwMGYyNjdiYTQifQ)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.keycloak.testsuite.pages.AbstractPage.assertCurrent(AbstractPage.java:47)
at jdk.internal.reflect.GeneratedMethodAccessor657.invoke(Unknown Source)
...
I did some experiments today, and created a PR https://github.com/bohmber/keycloak/pull/1 on top of your changes. The following thoughts drove this:
- If we add it to Keycloak we'll need to wrap it as a preview or experimental feature until we make this fully supported. Otherwise it will be hard to change the functionality if we need to.
- The error events should have the same event label as the non-error events, so you can sum up everything to get the total of events, and sum up everything without an error to see the successes.
- I prepared a new GlobalEventListener which doesn't need registration per realm. I think this is what admin want to see.
- The global event listener is configured by CLI options
- The configuration by event type might IMHO not be needed, as the number of events is reasonable, so I removed the code. If there is later a good reason to add it, we can still re-add it.
- All events should have the same labels to make it simpler to aggregate, so I removed the code for that
- If you have a lot of clients of IDPs, you will be swamped with metrics. So I added an option to skip those the tags.
The metrics tests are failing, some test would need to be added that the metrics actually show.
I also need to ask some other maintainer(s) to approve adding a new SPI, I'll reach out to them.
Besides that, what so you think about these changes?
Thanks Alexander,
I like the idea of having same labels for all events and allow skipping client and idp label. I'm still not sure that non-error and error event should have the same event label. We have one graph with logins per client and one graph with login_errors per client. The case that we need to sum logins und login_errors was never used. But I will accept your pr and we are waiting for feedback
We have one graph with logins per client and one graph with login_errors per client. The case that we need to sum logins und login_errors was never used.
I could see people calculating how many logins fail as a percentage of all login attempts.
Looking at https://prometheus.io/docs/practices/naming/, I read:
Use labels to differentiate the characteristics of the thing that is being measured
So to me, those are all logins, and they have the characteristics to be either successful (no error) or unsuccessful (with error).
Thank you for offering to merge it. I'll wait for more feedback from the core maintainers about the new SPI.
Rebased, as I saw some strange build errors.
For deployments with large number of realms and large numbers of clients (can be 100K+) would this not generate a huge amount of metrics that would impact performance of metric endpoints, use more memory, etc?
Not quite sure I see the use-cases for this.
While those events usually end up in logs, they are not a good source for metrics.
The answers it should provide are:
- number of logins
- number of failed logins
- refresh tokens total and failed
- client login
- client login failures
With a lower priority:
- logout and logout failures
For some installations with a limited number of realm and clients, those would be broken down by realm/client, for others where there are too many not.
The number above are also a base for the sizing guide, so Keycloak should provide those out of the box.
To add to the previous:
The audience would be admins monitoring the system.
- If any one this numbers would go outside of the normal (too many logins, to few logins, to many errors in logins), this would trigger an alert.
- You would analyze trends in the behavior or weekly and monthly cycles patterns, and then scale your setup up and down and plan ahead for extra capacity.
Thanks, my head is a bit clearer around this now.
@bohmber - can you please squash your commits and add a Signoff for your commits as outlined in the CONTRIBUTING.md guide? This would be required to merge this PR. Thanks!
Should I add the list of 'Co-authored-by' from aerogear/keycloak-metrics-spi as well?
@bohmber - you could mention in free-form that you were inspired by that code, and then add your regular user's sign-off. IMHO the code changed a lot during this PR, and there is not much left-over after those iterations. And it is OK to use that code as it is Apache 2 licensed, the same license we use in this project.
@ahus1 are the commits okay like this?