keycloak-community KEYCLOAK-8288 Observability #2

https://issues.redhat.com/browse/KEYCLOAK-8288

This Pull Request contains Observability design proposal. If you'd like to view it with inlined images, please use this link: https://github.com/slaskawi/keycloak-community/blob/KEYCLOAK-8288-Observability/design/observerability.md

Sep 24 '20 09:09 slaskawi

I really like this new proposal.

One thing we do may be interesting here too.

We separate the view on the API levels. so -> how healthy is the account API vs how health the admin API vs the authentication/authorization APIs.

Why? We judge different API's less critical to the customer. E.g. if the admin API fails it mostly does not affect the end-user while an issue in the login flows and authorization endpoints will affect the whole userbase.

Does this make sense? A simple view on this would be additonal series displayed in the error graphs by api

Sep 24 '20 10:09 bs-matil

This is a great start, but there's still quite a bit that needs to be done before we can agree on what exactly it should look like. One very important thing is listing failure scenarios, and how dashboard, logging, etc. would help SRE staff to identify and resolve those scenarios. Without it I feel we are kinda shooting in the dark.

Sep 25 '20 07:09 stianst

This is a great start, but there's still quite a bit that needs to be done before we can agree on what exactly it should look like. One very important thing is listing failure scenarios, and how dashboard, logging, etc. would help SRE staff to identify and resolve those scenarios. Without it I feel we are kinda shooting in the dark.

As part of the design of observerability I want a list of failure scenarios, and a report analysing how the proposed design would allow SRE staff to detect and resolve the issues covered in the failure scenarios. That is the only way to verify what we are building is what is needed.

@stianst I agree. Adding the failure scenarios and showing how to resolve them is a good idea. I'll add this in the next iteration.

Another thing that is missing in this design is more details around logging. What logging do we have that is useful if any? Is there logging that is missing (related to failure scenarios)? What about log entry numbers?

@stianst I had some discussions around this with @david-martin. The main conclusion was that log messages are quite useful when investigating failures. Detecting whether a service is up and healthy is should happen based on metrics. So in this sense, log messages are a secondary priority to this design.

Perhaps, it would satisfy your requirements if I showed the number of error logs emitted per each failure scenario? This way we would have at least some idea about the logging but I wouldn't need to go into the details. If this doesn't work for you - please suggest how would you like me to explore this area.

Also, what about Keycloak events? I can see they are collecting a lot of useful information, but this is not covered at all in this design.

@stianst After having a conversation with @david-martin, @maleck13 and @pb82 it seems the SRE Team is not interested (at least in the first iteration of this feature) in things like the number of successful/failure logins etc. Those metrics were not included in the SRE Dashboard, thus there's no need to expose them. In other words once this feature gets implemented and the Keycloak Operator will produce an updated SRE Dashboard, we'll drop the Aerogear extension (if that's what you're asking between the lines). Perhaps this conversation will be resumed when considering Observability v2 in the future.

@david-martin, @maleck13 and @pb82 Please let me know if I didn't interpret our previous conversations correctly and this is something we need.

I really like this new proposal.

@bs-matil Thank you!

We separate the view on the API levels. so -> how healthy is the account API vs how health the admin API vs the authentication/authorization APIs.

@bs-matil We'd like to divide Keycloak into a course-grained services, like Account, Admin, OIDC etc. (see the conversion in one of the comments). Then, we'll track the health per each of those course-grained components.

Why? We judge different API's less critical to the customer. E.g. if the admin API fails it mostly does not affect the end-user while an issue in the login flows and authorization endpoints will affect the whole userbase.

@bs-matil Individual endpoints might not be the best way to do it. Imagine oAuth Authorization endpoint being up (`/auth') but for some reason the Token endpoint being down ('/token'). In this case OIDC (technically oAuth, but I believe we'll put this in OIDC component) would report being down. I believe that's a much more valuable information than the Token endpoint is down and Authorization endpoint being up. The end result is that oAuth/OIDC service doesn't work.

Does this make sense? A simple view on this would be additonal series displayed in the error graphs by api

@bs-matil We prefer to track it via Keycloak Service. See my explanation the above.

Sep 25 '20 13:09 slaskawi

Dear all,

I've just uploaded the second iteration of the Observability proposal.

Differences from the previous version:

Proposed the new split for Keycloak Services: Admin, Account, OIDC, SAML, Token Exchange and User Federation
Updated the SRE Dashboard:
- Included the new "Session Counts" in the "Detailed metrics" section. There was an RFE for this
- Introduced "Operator metrics"
- Added "User Federation" in "Experimental metrics"
Added the "Investigating failures" paragraph
Added more information about Logging Improvements (now in a separate paragraph)
Switched the implementation for Micrometer

Feedback is more than welcome!

Thanks, Sebastian

Oct 05 '20 11:10 slaskawi

@slaskawi would be possible to add how we plan to protect those endpoints? I can think about mTLS, but it would be nice to make it clear into the document.

Asking because this may have an impact in the Healcheck proposal https://github.com/keycloak/keycloak-community/pull/55, and also in the implementation details.

Oct 07 '20 13:10 abstractj

@slaskawi I also think that Keycloak Events are an important source of information and very useful when monitoring the environment. It is not about the number of (un)successful login attempts only but monitor the system usage and detect possible anomalies. I can give more examples of how they are important, but even a simple EMAIL_SEND_FAILED may indicate critical problems that may be impacting the business and the overall availability of the deployment.

I also missing information about:

Response times. Mainly for authentication and/or any other service that impacts end-user. It should help to monitor usage and track anomalies once a baseline is created based on history.
Datasource
Thread Pool
Downstream Services which the server relies on. For instance, SMTP Server, LDAP Server, etc (Datasource too, as mentioned above)
Session Counts. In addition to what you have, I would say that some other metrics are important like those related to eviction and removal. Possibly, separate graphs for the most critical caches.
Cluster. Not sure if this is possible, but somehow monitor cluster partitions. This is usually related to unavailability issues and may indicate problems in other areas.
The servers are going to be behind an LB/proxy, right? I think metrics we can obtain from LB/proxy are important too to understand how much load we are getting. In conjunction with the Thread Pool metrics, we can track anomalies such as those related to thread contention.

I also think that Coarse-grained Keycloak Services are too coarse-grained. For instance, what OIDC really means and what the Aggregated Latency Graph is really showing? Is it related to authorization requests, or token requests, or refresh token requests? The same goes for SAML, is it related to authentication requests or responses?

Oct 08 '20 10:10 pedroigor

would be possible to add how we plan to protect those endpoints? I can think about mTLS, but it would be nice to make it clear into the document.

@abstractj That's a good point. I'll include this in the next iteration

I also think that Keycloak Events are an important source of information and very useful when monitoring the environment. It is not about the number of (un)successful login attempts only but monitor the system usage and detect possible anomalies. I can give more examples of how they are important, but even a simple EMAIL_SEND_FAILED may indicate critical problems that may be impacting the business and the overall availability of the deployment.

@pedroigor I totally agree. However, as I outlined in my previous comment the SRE Team is mostly interested in keeping Keycloak alive. Checking if Keycloak is under attack and other similar things (@stianst called it "Security as a Service" at some point - which is a quite good term) is definitely something we should think for the future.

I will create a separate paragraph, called "Future Work" and collect all those ideas in a single place. Does it work for you?

I also think that Coarse-grained Keycloak Services are too coarse-grained

@pedroigor This has been agreed (and actually proposed by) @stianst. Perhaps you two could do a small discussion and come up with a single proposal. I'm open for suggestions here.

For instance, what OIDC really means and what the Aggregated Latency Graph is really showing

@pedroigor This would be an aggregate latency (a histogram to be precise) of the response times over authorization and token endpoint. This is something to start with. The main idea is to get some average latency and observe if it goes up very quickly. Such a behavior might indicate performance problems coming your way.

Oct 12 '20 10:10 slaskawi

Today morning, both @stianst and I had a discussion on the future plans of both Healthcheck API and the Observability designs.

We agreed that our main focus will be Quarkus-based distribution. As for the time being, we'll use what we have for the Wildfly-based distribution (which translates to the embedded metrics, custom Aerogear Extension for additional metrics and the Healthcheck as they are now). Both Designs will be repurposed and updated in the following days.

One of the crucial differences is that we agreed to verify if we can use CDI extension in Quarkus (just to provide additional context - the CDI has been removed from Wildfly-based distribution as it introduces additional latency and increased the bootstrap time). Both Microprofile Health as well as Micrometer implementation provide CDI extensions to easily wire everything together.

//cc @stianst @ASzc @miquelsi @abstractj @vmuzikar @pedroigor

Oct 14 '20 12:10 slaskawi

@pedroigor I totally agree. However, as I outlined in my previous comment the SRE Team is mostly interested in keeping Keycloak alive. Checking if Keycloak is under attack and other similar things (@stianst called it "Security as a Service" at some point - which is a quite good term) is definitely something we should think for the future.

Now I see your comment. Sorry. But still, events should give useful information about attacks. But I understand it is out of scope for now.

I will create a separate paragraph, called "Future Work" and collect all those ideas in a single place. Does it work for you?

Don't worry if it does not make sense to include this document.

I also think that Coarse-grained Keycloak Services are too coarse-grained

@pedroigor This has been agreed (and actually proposed by) @stianst. Perhaps you two could do a small discussion and come up with a single proposal. I'm open for suggestions here.

Take OIDC for instance. IMO, both authorization and token endpoint should be considered a service. These are the main on involved in the different flows when users are authenticating. Depending on how people would use us, I would also have both token introspection and userinfo separated. Grouping all those in a single adds little value because it won't be easy to visualize what is going on.

But again, I'm late here :) So maybe this another thing that we'll looking later?

But yeah, other things that I mentioned such as the Thread and Database Pools, as well as caches, are quite crucial, IMO.

Maybe, SRE should talk with some people using Keycloak and Grafana (we can talk about which ones in private as they are customers) to get some inspiration.

For instance, what OIDC really means and what the Aggregated Latency Graph is really showing

@pedroigor This would be an aggregate latency (a histogram to be precise) of the response times over authorization and token endpoint. This is something to start with. The main idea is to get some average latency and observe if it goes up very quickly. Such a behavior might indicate performance problems coming your way.

Yeah, but still too high level. But it is OK if SRE is fine with it for now.

Oct 14 '20 20:10 pedroigor

Similarly to #55, I would like to see an example of a successful and unsuccessful response in the design. Could it be added?

@hmlnarik In case of metrics, we don't take a failure scenarios into account. The fact that Prometheus can not scrape metrics (e.g. the /metrics endpoint returns HTTP 500) needs to be reported at Platform Health level. I imagine there are separate dashboards to do that.

Albeit I understand this is currently out of scope for reasons stated above, I completely agree with @pedroigor that events are important source of attack detection and should eventually find the way to the observability endpoint, at least their count per event type. Implementation of these should be pretty straightforward once the observability endpoint is up and running.

Now I see your comment. Sorry. But still, events should give useful information about attacks. But I understand it is out of scope for now.

@pedroigor @hmlnarik Ok, so what type of events would you like to see on the dashboard and why? Perhaps for the first iteration we should just have a histogram of what events were emitted? It's generic enough and let's you build other, more complicated dashboard based on this information.

Also, would there be a single observability endpoint or would the metrics be somehow split into categories? From performance perspective, it makes sense the latter. The spec seems to leverage tags for this purpose. If we move this direction, I believe the tags or any other means of metric separation should be part of this proposal.

@hmlnarik No, there's only one /metrics endpoint, which is scaped by Prometheus.

Take OIDC for instance. IMO, both authorization and token endpoint should be considered a service. These are the main on involved in the different flows when users are authenticating. Depending on how people would use us, I would also have both token introspection and userinfo separated. Grouping all those in a single adds little value because it won't be easy to visualize what is going on.

@pedroigor I think putting both of them into the same category makes more sense. Imagine an authorization endpoint being down and token endpoint being up. What does it mean? Some parts of the OIDC spec might work and some might not. But from the user perspective - it's mostly down (you can not log in, right). This is why I think it makes more sense to put them into a single category.

But yeah, other things that I mentioned such as the Thread and Database Pools, as well as caches, are quite crucial, IMO.

@pedroigor I wouldn't go that far to be honest. At least not for an initial implementation. To me it's too fine-grained.

Nov 24 '20 10:11 slaskawi

keycloak-community keycloak-community copied to clipboard

KEYCLOAK-8288 Observability #2

keycloak-community
keycloak-community copied to clipboard