[RFC] Structured Error type
/kind feature
Describe the solution you'd like Sometimes I want to be able to add structured logging key-value pairs where an error is generated, but it isn't logged until way up the stack. It would be nice if structured logging information could be attached to the error itself, and logged with the error.
Rough proposal:
+ type StructuredError struct {
+ error
+ KeysAndValues []any
+ }
+
+ func (err StructuredError) RecursiveKeysAndValues() []any {
+ kvs := err.KeysAndValues
+ var inner *StructuredError
+ if errors.As(err.error, &inner) {
+ // When duplicate keys are present, should the inner most or outermost error win?
+ kvs = serialize.MergeKVs(inner.RecursiveKeysAndValues(), kvs)
+ }
+ return kvs
+ }
func (l *loggingT) errorS(err error, logger *logr.Logger, filter LogFilter, depth int, msg string, keysAndValues ...interface{}) {
+ var structuredErr *StructuredError
+ if errors.As(err, &structuredErr) {
+ keysAndValues = append(keysAndValues, structuredErr.RecursiveKeysAndValues()...)
+ }
if filter != nil {
msg, keysAndValues = filter.FilterS(msg, keysAndValues)
}
if logger != nil {
logger.WithCallDepth(depth+2).Error(err, msg, keysAndValues...)
return
}
l.printS(err, severity.ErrorLog, depth+1, msg, keysAndValues...)
}
Open Question: should the message returned by structuredErr.Error() include the keyvalue pairs? In that case they should be omitted from the message when logged with klog, but that gets complicated when you have a non-structured error wrapping a structured error wrapping a non-structured error. My inclination is to not include them in the default Error() message for this reason.
Open Question: should the StructuredError type be hoisted to logr? cc @thockin
logr supports types that define a MarshalLog() method, but I don't know if that is plumbed all the way to klog, or which modes support it properly.
The klog text output currently checks for error first and then logs the message from Error. Changing that is not hard:
diff --git a/internal/serialize/keyvalues.go b/internal/serialize/keyvalues.go
index ad6bf11..33813b4 100644
--- a/internal/serialize/keyvalues.go
+++ b/internal/serialize/keyvalues.go
@@ -127,8 +127,6 @@ func KVListFormat(b *bytes.Buffer, keysAndValues ...interface{}) {
writeStringValue(b, true, StringerToString(v))
case string:
writeStringValue(b, true, v)
- case error:
- writeStringValue(b, true, ErrorToString(v))
case logr.Marshaler:
value := MarshalerToValue(v)
// A marshaler that returns a string is useful for
@@ -147,6 +145,8 @@ func KVListFormat(b *bytes.Buffer, keysAndValues ...interface{}) {
default:
writeStringValue(b, false, fmt.Sprintf("%+v", value))
}
+ case error:
+ writeStringValue(b, true, ErrorToString(v))
case []byte:
// In https://github.com/kubernetes/klog/pull/237 it was decided
// to format byte slices with "%+q". The advantages of that are:
However, that then only has the effect that the structured error gets formatted with fmt.Sprintf("%+v"), like any other struct:
// Structured error with additional values that are not part of the message
// returned by Error().
type structuredError struct {
error
Attributes map[string]interface{}
}
// Overrides the Error function as the preferred way for logging
// this error.
func (err structuredError) MarshalLog() interface{} {
return struct {
Message string
Attributes map[string]interface{}
}{
Message: err.Error(),
Attributes: err.Attributes,
}
}
structuredError{error: errors.New("fake error"), Attributes: map[string]interface{}{"x": 1, "y": 2}}
=>
E output.go:422] "structured error" err={Message:fake error Attributes:map[x:1 y:2]}
The extra information is there, but it's not getting formatted well. Improving that would be more complicated.
zapr also uses Error. Patch to change that:
diff --git a/zapr.go b/zapr.go
index 8bb7fce..eb9d44b 100644
--- a/zapr.go
+++ b/zapr.go
@@ -213,7 +213,18 @@ func (zl *zapLogger) Info(lvl int, msg string, keysAndVals ...interface{}) {
func (zl *zapLogger) Error(err error, msg string, keysAndVals ...interface{}) {
if checkedEntry := zl.l.Check(zap.ErrorLevel, msg); checkedEntry != nil {
- checkedEntry.Write(zl.handleFields(noLevel, keysAndVals, zap.NamedError(zl.errorKey, err))...)
+ field := zap.NamedError(zl.errorKey, err)
+ if logMarshaler, ok := err.(logr.Marshaler); ok {
+ func() {
+ defer func() {
+ // Ignore internal error in MarshalLog call.
+ // err will get logged as normal error field.
+ recover()
+ }()
+ field = zap.Any(zl.errorKey, logMarshaler.MarshalLog())
+ }()
+ }
+ checkedEntry.Write(zl.handleFields(noLevel, keysAndVals, field)...)
}
}
The output then becomes:
{"caller":"test/output.go:422","msg":"structured error","err":{"Message":"fake error","Attributes":{"x":1,"y":2}}}
@tallclair: is that good enough?
I'm a bit worried about "err" changing its schema depending on the type of error. I've been told that some log processing tools get confused when that happens. We don't guarantee it for other keys either, but "err" is a pretty common one.
It's a good point that perhaps the err argument should be treated with Error(). Having it be string in some cases and struct in others may confuse systems.
funcr, for example, goes out of its way to call Error() and then log the result, rather than treat it as any other key-value
Perhaps we can log both with separate keys:
- "err":
<Error()> - "errObject":
<MarshalLog>
To achieve what @tallclair originally asked for (providing additional key/value pairs) we could add a new type to logr:
type KeyValuePairs map[string]interface{}
If, and only if, MarshalLog returns an object of that type, then a logger that supports this could ignore the key that normally would be used for the object (in the example above, "errObject") and instead log the key/value pairs as if they had been provided directly.
I'm undecided whether that special support should be limited to MarshalLog. It may also be useful to capture multiple key/value pairs in a single value without having to flatten them into a slice with other key/value pairs. This may even be recursive.
Key name conflicts could become a problem:
logger.Info("hello", "a": 1, "b": KeyValuePairs{"a": 2})
->
{"msg": "hello", "a":1, "a":2} ?!
I think I would instead concatenate the keys while expanding:
{"msg": "hello", "a":1, "b.a":2}
"structured error" a=1 b.a=2
While conceptually simple, using a map makes the output non-deterministic unless the logger sorts by key. A better type might be a slice:
type KeyValuePairs []KeyValuePair
type KeyValuePair struct {Key string, Value interface{}}
I'm choosing to limit keys to strings because it simplifies the logging. We are not forced to use interface{} as in the WithValues and Info/Error parameters.
funcr has PseudoStruct which is similar to what you are describing.
I did it in funcr because it felt kind of dirty to try to push that
into every sink.
Maybe a construction-time flag to klog/etc like AlsoErrorMarshalLog
meaning, to log both err=string and errRaw=? when er supports
MarshalLog?
funcr has
PseudoStructwhich is similar to what you are describing. I did it in funcr because it felt kind of dirty to try to push that into every sink.
Not every sink has to support it. If they don't, they'll just log the KeyValuePairs like they normally do (map or list, depending on what we choose), which still exposes the same information.
Maybe a construction-time flag to klog/etc like
AlsoErrorMarshalLogmeaning, to log both err=string and errRaw=? when er supports MarshalLog?
Are there situations where we would want such a AlsoErrorMarshalLog to be false?
I implemented logging a "structured error" as "err" + "errDetails":
- https://github.com/go-logr/zapr/pull/56
- https://github.com/kubernetes/klog/pull/359
- https://github.com/go-logr/logr/pull/164
This is work in progress (lacks some unit tests, depends on release of logr with those changes first), but should be complete enough to discuss whether this makes sense conceptually.
The name of "errDetails" is configurable in zapr. Conceptually this is similar to how zap handles an error that implements fmt.Formatter: for those it adds "errVerbose".
Thanks for following up on this @pohly ! I want to make sure I understand how all these pieces connect. My understanding is that with your changes above, I could do something like this:
type StructuredError struct {
err error
detail logr.KeysAndValues
}
func (e *StructuredError) Error() string {
return e.err.Error()
}
func (e *StructuredError) MarshalLog() interface{} {
return detail
}
And then something like &StructuredError{errors.New("my error"), logr.KeysAndValues{{"extra": "foo-bar"}}} would get logged like:
err="my error" errDetails={extra="foo-bar"}
Does that look right? If I understood correctly, I think this works for my use case.
That's correct. One caveat is that there's no consensus yet on the logr.KeysAndValues type.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Update: waiting for log/slog in Go 1.21 IMHO was worthwhile. Instead of adding a new special logr.KeysAndValues, we can instead use slog.GroupValue and the LogValue() slog.Value method. Support for those is needed anyway if we want to support slog in a LogSink and was already implemented (PRs pending for klog and zapr).
I've updated https://github.com/go-logr/zapr/pull/56 and could do the same for klog.
@tallclair: is depending on slog and thus Go 1.21 okay?
/unassign thockin /assign
The following PRs are ready for merging, if we agree on the approach:
- https://github.com/go-logr/zapr/pull/56
- https://github.com/kubernetes/kubernetes/pull/121984
TODO: klog text format
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
ping @tallclair: still interested?
Yes. Nothing (that I'm aware of) is blocked on this, but I've run into a few places where this would be useful, either enabling more detailed error logging, or reducing the amount of context that needs to be plumbed through.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.