klog [RFC] Structured Error type

/kind feature

Describe the solution you'd like Sometimes I want to be able to add structured logging key-value pairs where an error is generated, but it isn't logged until way up the stack. It would be nice if structured logging information could be attached to the error itself, and logged with the error.

Rough proposal:

+ type StructuredError struct {
+   error
+   KeysAndValues []any
+ }
+ 
+ func (err StructuredError) RecursiveKeysAndValues() []any {
+	kvs := err.KeysAndValues
+	var inner *StructuredError
+	if errors.As(err.error, &inner) {
+		// When duplicate keys are present, should the inner most or outermost error win?
+		kvs = serialize.MergeKVs(inner.RecursiveKeysAndValues(), kvs)
+ 	}
+ 	return kvs
+ }

func (l *loggingT) errorS(err error, logger *logr.Logger, filter LogFilter, depth int, msg string, keysAndValues ...interface{}) {
+	var structuredErr *StructuredError
+	if errors.As(err, &structuredErr) {
+ 		keysAndValues = append(keysAndValues, structuredErr.RecursiveKeysAndValues()...)
+ 	}
	if filter != nil {
		msg, keysAndValues = filter.FilterS(msg, keysAndValues)
	}
	if logger != nil {
		logger.WithCallDepth(depth+2).Error(err, msg, keysAndValues...)
		return
	}
	l.printS(err, severity.ErrorLog, depth+1, msg, keysAndValues...)
}

Open Question: should the message returned by structuredErr.Error() include the keyvalue pairs? In that case they should be omitted from the message when logged with klog, but that gets complicated when you have a non-structured error wrapping a structured error wrapping a non-structured error. My inclination is to not include them in the default Error() message for this reason.

Open Question: should the StructuredError type be hoisted to logr? cc @thockin

Nov 03 '22 19:11 tallclair

logr supports types that define a MarshalLog() method, but I don't know if that is plumbed all the way to klog, or which modes support it properly.

Nov 15 '22 20:11 thockin

The klog text output currently checks for error first and then logs the message from Error. Changing that is not hard:

diff --git a/internal/serialize/keyvalues.go b/internal/serialize/keyvalues.go
index ad6bf11..33813b4 100644
--- a/internal/serialize/keyvalues.go
+++ b/internal/serialize/keyvalues.go
@@ -127,8 +127,6 @@ func KVListFormat(b *bytes.Buffer, keysAndValues ...interface{}) {
                        writeStringValue(b, true, StringerToString(v))
                case string:
                        writeStringValue(b, true, v)
-               case error:
-                       writeStringValue(b, true, ErrorToString(v))
                case logr.Marshaler:
                        value := MarshalerToValue(v)
                        // A marshaler that returns a string is useful for
@@ -147,6 +145,8 @@ func KVListFormat(b *bytes.Buffer, keysAndValues ...interface{}) {
                        default:
                                writeStringValue(b, false, fmt.Sprintf("%+v", value))
                        }
+               case error:
+                       writeStringValue(b, true, ErrorToString(v))
                case []byte:
                        // In https://github.com/kubernetes/klog/pull/237 it was decided
                        // to format byte slices with "%+q". The advantages of that are:

However, that then only has the effect that the structured error gets formatted with fmt.Sprintf("%+v"), like any other struct:

// Structured error with additional values that are not part of the message
// returned by Error().
type structuredError struct {
	error
	Attributes map[string]interface{}
}

// Overrides the Error function as the preferred way for logging
// this error.
func (err structuredError) MarshalLog() interface{} {
	return struct {
		Message    string
		Attributes map[string]interface{}
	}{
		Message:    err.Error(),
		Attributes: err.Attributes,
	}
}

structuredError{error: errors.New("fake error"), Attributes: map[string]interface{}{"x": 1, "y": 2}}
=>
E output.go:422] "structured error" err={Message:fake error Attributes:map[x:1 y:2]}

The extra information is there, but it's not getting formatted well. Improving that would be more complicated.

zapr also uses Error. Patch to change that:

diff --git a/zapr.go b/zapr.go
index 8bb7fce..eb9d44b 100644
--- a/zapr.go
+++ b/zapr.go
@@ -213,7 +213,18 @@ func (zl *zapLogger) Info(lvl int, msg string, keysAndVals ...interface{}) {
 
 func (zl *zapLogger) Error(err error, msg string, keysAndVals ...interface{}) {
        if checkedEntry := zl.l.Check(zap.ErrorLevel, msg); checkedEntry != nil {
-               checkedEntry.Write(zl.handleFields(noLevel, keysAndVals, zap.NamedError(zl.errorKey, err))...)
+               field := zap.NamedError(zl.errorKey, err)
+               if logMarshaler, ok := err.(logr.Marshaler); ok {
+                       func() {
+                               defer func() {
+                                       // Ignore internal error in MarshalLog call.
+                                       // err will get logged as normal error field.
+                                       recover()
+                               }()
+                               field = zap.Any(zl.errorKey, logMarshaler.MarshalLog())
+                       }()
+               }
+               checkedEntry.Write(zl.handleFields(noLevel, keysAndVals, field)...)
        }
 }

The output then becomes:

{"caller":"test/output.go:422","msg":"structured error","err":{"Message":"fake error","Attributes":{"x":1,"y":2}}}

@tallclair: is that good enough?

I'm a bit worried about "err" changing its schema depending on the type of error. I've been told that some log processing tools get confused when that happens. We don't guarantee it for other keys either, but "err" is a pretty common one.

Nov 18 '22 13:11 pohly

It's a good point that perhaps the err argument should be treated with Error(). Having it be string in some cases and struct in others may confuse systems.

funcr, for example, goes out of its way to call Error() and then log the result, rather than treat it as any other key-value

Nov 18 '22 22:11 thockin

Perhaps we can log both with separate keys:

"err": <Error()>
"errObject": <MarshalLog>

To achieve what @tallclair originally asked for (providing additional key/value pairs) we could add a new type to logr:

type KeyValuePairs map[string]interface{}

If, and only if, MarshalLog returns an object of that type, then a logger that supports this could ignore the key that normally would be used for the object (in the example above, "errObject") and instead log the key/value pairs as if they had been provided directly.

I'm undecided whether that special support should be limited to MarshalLog. It may also be useful to capture multiple key/value pairs in a single value without having to flatten them into a slice with other key/value pairs. This may even be recursive.

Key name conflicts could become a problem:

logger.Info("hello", "a": 1, "b": KeyValuePairs{"a": 2})
->
{"msg": "hello", "a":1, "a":2} ?!

I think I would instead concatenate the keys while expanding:

{"msg": "hello", "a":1, "b.a":2}
"structured error" a=1 b.a=2

While conceptually simple, using a map makes the output non-deterministic unless the logger sorts by key. A better type might be a slice:

type KeyValuePairs []KeyValuePair
type KeyValuePair struct {Key string, Value interface{}}

I'm choosing to limit keys to strings because it simplifies the logging. We are not forced to use interface{} as in the WithValues and Info/Error parameters.

Nov 19 '22 10:11 pohly

funcr has PseudoStruct which is similar to what you are describing. I did it in funcr because it felt kind of dirty to try to push that into every sink.

Maybe a construction-time flag to klog/etc like AlsoErrorMarshalLog meaning, to log both err=string and errRaw=? when er supports MarshalLog?

Nov 19 '22 19:11 thockin

funcr has PseudoStruct which is similar to what you are describing. I did it in funcr because it felt kind of dirty to try to push that into every sink.

Not every sink has to support it. If they don't, they'll just log the KeyValuePairs like they normally do (map or list, depending on what we choose), which still exposes the same information.

Maybe a construction-time flag to klog/etc like AlsoErrorMarshalLog meaning, to log both err=string and errRaw=? when er supports MarshalLog?

Are there situations where we would want such a AlsoErrorMarshalLog to be false?

Nov 19 '22 20:11 pohly

I implemented logging a "structured error" as "err" + "errDetails":

https://github.com/go-logr/zapr/pull/56
https://github.com/kubernetes/klog/pull/359
https://github.com/go-logr/logr/pull/164

This is work in progress (lacks some unit tests, depends on release of logr with those changes first), but should be complete enough to discuss whether this makes sense conceptually.

The name of "errDetails" is configurable in zapr. Conceptually this is similar to how zap handles an error that implements fmt.Formatter: for those it adds "errVerbose".

Nov 23 '22 12:11 pohly

Thanks for following up on this @pohly ! I want to make sure I understand how all these pieces connect. My understanding is that with your changes above, I could do something like this:

type StructuredError struct {
  err error
  detail logr.KeysAndValues
}

func (e *StructuredError) Error() string {
  return e.err.Error()
}

func (e *StructuredError) MarshalLog() interface{} {
  return detail
}

And then something like &StructuredError{errors.New("my error"), logr.KeysAndValues{{"extra": "foo-bar"}}} would get logged like:

err="my error" errDetails={extra="foo-bar"}

Does that look right? If I understood correctly, I think this works for my use case.

Dec 02 '22 18:12 tallclair

That's correct. One caveat is that there's no consensus yet on the logr.KeysAndValues type.

Dec 03 '22 10:12 pohly

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 03 '23 10:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Apr 02 '23 11:04 k8s-triage-robot

/remove-lifecycle rotten

Apr 02 '23 21:04 pohly

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 01 '23 21:07 k8s-triage-robot

Update: waiting for log/slog in Go 1.21 IMHO was worthwhile. Instead of adding a new special logr.KeysAndValues, we can instead use slog.GroupValue and the LogValue() slog.Value method. Support for those is needed anyway if we want to support slog in a LogSink and was already implemented (PRs pending for klog and zapr).

I've updated https://github.com/go-logr/zapr/pull/56 and could do the same for klog.

@tallclair: is depending on slog and thus Go 1.21 okay?

Sep 19 '23 06:09 pohly

/unassign thockin /assign

Nov 21 '23 07:11 pohly

The following PRs are ready for merging, if we agree on the approach:

https://github.com/go-logr/zapr/pull/56
https://github.com/kubernetes/kubernetes/pull/121984

TODO: klog text format

Nov 21 '23 14:11 pohly

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jan 20 '24 05:01 k8s-triage-robot

/remove-lifecycle rotten

ping @tallclair: still interested?

Jan 22 '24 08:01 pohly

Yes. Nothing (that I'm aware of) is blocked on this, but I've run into a few places where this would be useful, either enabling more detailed error logging, or reducing the amount of context that needs to be plumbed through.

Jan 23 '24 19:01 tallclair

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 22 '24 20:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 22 '24 20:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jun 21 '24 21:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 21 '24 21:06 k8s-ci-robot