cluster-api-provider-aws Consistent Logging

trafficstars

I've found that the use of log format and logging levels are inconsistent in the repo.

A few examples

Different log levels are used across reconcilers.

s.scope.V(2).Info("Reconciling network for cluster", "cluster-name", s.scope.Name(), "cluster-namespace", s.scope.Namespace())
s.scope.V(2).Info("Reconciling VPC")
s.scope.Info("Reconciling subnets")
s.scope.V(2).Info("Reconciling security groups")

Log formats are different for the main controllers.

I0203 23:10:34.968187    1165 awsmachine_controller.go:414]  "msg"="Reconciling AWSMachine" 

I0203 23:12:40.884859    1165 awscluster_controller.go:202] controller/awscluster "msg"="Reconciling AWSCluster" "cluster"="quick-start-cghvfu" "name"="quick-start-cghvfu" "namespace"="quick-start-d0tfno" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster"

Logging for AWS resource creation/deletion is inconsistent. The reason for using V(0) for Deleted association between route table and subnet while using V(2) for Deleted security group is not clear to me.

s.scope.Info("Deleted association between route table and subnet", "route-table-id", *rt.RouteTableId, "subnet-id", *as.SubnetId)
s.scope.V(2).Info("Deleted security group", "security-group-id", sg.ID, "kind", typ)

See https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/3143.

What do you expect?

We should have a guideline/documentation about logging format and level to improve consistency.
Create follow up issues if there are improvements that can be implemented.

Related Issues

3001

Feb 04 '22 18:02 pydctw

@pydctw: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 04 '22 18:02 k8s-ci-robot

@pydctw - we also have #3101 which is related. I know @Skarlso was looking at standardizing log levels as part of that work. It would be worth you both chatting.

Feb 07 '22 10:02 richardcase

Oh nice! I have been working on a POC to replace that stuff. It's a bit difficult to find out though which was supposed to do what. And it will be a rather large change I'm afraid.

Feb 07 '22 10:02 Skarlso

@Skarlso, I would love to hear what you are thinking. Let's sync up.

Feb 07 '22 20:02 pydctw

@pydctw Sure, will write up a detailed POC and link a WIP PR tomorrow. :)

Feb 07 '22 20:02 Skarlso

We should also consider adding JSON support during this refactoring, which is being discussed in CAPI: https://github.com/kubernetes-sigs/cluster-api/issues/5571#issuecomment-1034000389

Feb 09 '22 18:02 sedefsavas

@sedefsavas I would rather do a single thing / PR. This will be a bit of a pain as is anyways. :D I worked out a clean API for logging but I have to replace a LOT of log lines. :))))) We can do it in a followup PR? :)

Feb 10 '22 08:02 Skarlso

Hmm, I read the whole thing. I guess I can use a different logger, like zerolog?

Feb 10 '22 08:02 Skarlso

@pydctw added the WIP here: https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/3191 with a description and an API I'm suggesting. :)

Feb 10 '22 19:02 Skarlso

Thanks for making progress on this @Skarlso

Since we are doing refactoring, it might be a good time to think about having consistency with cluster-api so that users won't have to grep differently to collect logs from different managers in the same management cluster.

It might worth further discussing this including other providers to see if a common approach could be adopted instead of having a custom logic here. cc @sbueringer given that you started a discussion already during community meeting.

Feb 10 '22 19:02 sedefsavas

For sure, I'm open for suggestions as to which logger to adopt. :) I can use whatever we want as long as it's consistent with everything else, I guess.

Feb 10 '22 19:02 Skarlso

To be clear, the reason for the clear interface is, of course, abstraction. I think it's sane to keep some kind of common interface and use whatever we want in the background. I would like to avoid having to change the logger again in so many places if that's okay. We can change the implementation underneath it easily if we keep a sane common interface.

Feb 10 '22 19:02 Skarlso

Hey, just a short info for now. In core CAPI we are only (or at least in almost all cases) logging with the standard log interface used in controller-runtime (which is logr.Logger).

So we don't really have our own log interface/implementation in core CAPI like CAPA has (as far as I know).

Core CAPI currently uses klog as underlying implementation of that interface (here we set the klogr as controller-runtime logger).

We are currently thinking about changing that underlying implementation to component-base/logs (which is used in kube-apiserver, kube-scheduler, ...). The effect of that is that as underlying implementation (behind logr.Logger) we would still be logging via klog, but in case the JSON logging format is enabled we are using the JSON logger implementation from component-base/logs. We are also inheriting the log flags from component-base/logs.

I think the best summary from a core CAPI perspective is this comment: https://github.com/kubernetes-sigs/cluster-api/issues/5571#issuecomment-1034000389

To be honest, I don't know the reason why CAPA currently has an own log interface or the requirements why that is (probably) necessary.

Feb 10 '22 20:02 sbueringer

@sbueringer Hey Stefan. We didn't have one until now. :) CAPA is also using klog and context logger and logr.

However, it's not very user friendly. It's difficult to grep for keywords, and the notion of doing this: log.V(5).Info is just not very developer friendly. And they you'll have to remember at which log level you used what output.

I was in the process of making a unified interface which is still using logr in the background, but at least from logging perspective, we use things like log.Debug, log.Info, log.Warning, and then just add a grep-able key value pair to it.

Feb 10 '22 20:02 Skarlso

Ah, got your point.

So it's about having a thin layer on top of go-logr which essentially provides the usual info/debug/... log levels instead of V(i) (in the code as well as an additional "level": "x", k/v pair). I think I need more time to think this over :). We were kind of happy that we potentially get rid of our own thin layer on top of go-logr we've used in the ClusterClass implementation.

cc @fabriziopandini

P.S. I think this is orthogonal to replacing the logger "below" the interface, but I think it's relevant regarding some other log improvements we're thinking about to make in core CAPI.

Feb 11 '22 08:02 sbueringer

I don't have strong opinions about logging interfaces, because all of them have both pros and cons. However, I'm strongly in favor of consistency across the ecosystems and in CAPI we are using the following documents as a north star

https://kubernetes.io/docs/concepts/cluster-administration/system-logs/
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md

Feb 11 '22 14:02 fabriziopandini

Regarding this bit:

klog.InfoS() has multiple levels:
- klog.V(0) - Generally useful for this to ALWAYS be visible to an operator
  - Programmer errors
  - Logging extra info about a panic
  - CLI argument handling
- klog.V(1) - A reasonable default log level if you don't want verbosity.
  - Information about config (listening on X, watching Y)
  - Errors that repeat frequently that relate to conditions that can be corrected (pod detected as unhealthy)
- klog.V(2) - Useful steady state information about the service and important log messages that may correlate to significant changes in the system. This is the recommended default log level for most systems.
  - Logging HTTP requests and their exit code
  - System state changing (killing pod)
  - Controller state change events (starting pods)
  - Scheduler log messages
- klog.V(3) - Extended information about changes
  - More info about system state changes
- klog.V(4) - Debug level verbosity
  - Logging in particularly thorny parts of code where you may want to come back later and check it
- klog.V(5) - Trace level verbosity
  - Context to understand the steps leading up to errors and warnings
  - More information for troubleshooting reported issues

I'm struggling to see the benefit of this. :) You trade readability with what exactly? Having switches like --v 4 or --v 3 asks the user to remember at what level a logging should occur as opposed to something like --v debug or --v trace which are more easy to remember. And in the code, you'll have to explicitly remember this mapping. Why put this burden on the user AND the developer too? I've read Dave's post, and I understand what he means by proliferation of information in the logs. But in this case then, there is no point in using these either. :) Also, these log entries are anything but grepabble or searchable for information such as, names, actions, filtering for errors, logs, relevant information. But maybe there is a point to make to not do that at all, or not care at all. Which would be more inline with Dave's post.

And I've read the other point why klog disagrees with Dave, but I don't actually like that direction because it's not very human friendly, I believe.

In any case. I bow my head to conform, but I feel like I had to share my opinion about the matter and why I wrote a wrapper interface in the first place.

Feb 11 '22 14:02 Skarlso

@Skarlso I personally appreciate and I'm truly thankful for your opinion, and I don't have problems in saying that there are a couple of points I agree with. My intent above was just to provide some context on how we are tackling this problem in CAPI. The final call here is clearly up to the CAPA community, not to me.

Feb 11 '22 16:02 fabriziopandini

@Skarslo I have some concerns regarding diverging from the cluster-api ecosystem. But I think it'd be worth to discuss this during our community meeting on Feb 21st at 9 am PT.

Feb 11 '22 17:02 sedefsavas

Thanks all for chiming in and sharing your opinions on this.

Feb 11 '22 17:02 sedefsavas

Cool, and thanks to all! 😊👍

Feb 11 '22 18:02 Skarlso

I'm with Fabrizio on that one. I also see some benefit in other log levels, but I prefer to stay consistent with the Kubernetes ecosystem.

We also have to take into account that we are using a lot of upstream libraries like controller-runtime and client-go. They are all logging with the .V(x) option. I'm not sure if client-go is doing it already but they at least want to use the logger we pass into calls with ctx (https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/3077-contextual-logging). Of course we could pass in a logger which translates the ~ V(0)-V(10) messages to our log levels, but I personally would really prefer not doing that in favor of conformance.

But I also agree with Fabrizio, it's a good discussion to have, it's just my opinion and I think it's something that the CAPA community should decide.

Feb 17 '22 10:02 sbueringer

@Skarlso, we wanted to discuss this issue during yesterday's CAPA office hour but since you were not there, following up on the issue.

This is my understanding reading comments above but to confirm - you are ok for CAPA to follow k8s ecosystem's logging convention of using V(0)-V(10) messages instead of a logging interface from the PoC?

Mar 08 '22 20:03 pydctw

Thanks, sorry, I was out on vacation. :) will take a look.

Mar 10 '22 09:03 Skarlso

A related issue: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/2944

Mar 11 '22 19:03 sedefsavas

@sedefsavas Okay, so... I'm okay with following this convention, but that will result in a not very searchable log entry which is what the initial problem also concluded. So I will have to wrap it in something that sets up searchable entries anyways. :)

Mar 13 '22 19:03 Skarlso

And even if we follow the verbosity levels, we still need guidance and constants in the code that help people know which verbosity level they should use for their situation.

Mar 15 '22 14:03 richardcase

Absolutely.

Mar 15 '22 19:03 Skarlso

@Skarlso - will ping you on slack.

Mar 16 '22 08:03 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 14 '22 09:06 k8s-triage-robot

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

Consistent Logging

What do you expect?

Related Issues

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard