Include "remote_addr" in Zed lake log messages about requests/responses

Open philrz opened this issue 2 years ago • 1 comments

Repro is with GA Zed tagged v1.8.0.

Consider the attached zed serve log zlake.txt that was generated during a repro of #4446 by connecting to the service as a remote lake from Zui and opening a new Query Session as described in https://github.com/brimdata/zui/issues/2775.

A community zync user recently reported seeing the #4446 log message in their environment and assumed it was severe due to it being an HTTP 500 and having a stack trace. The intent of #4446 is to turn that into an HTTP 400, so that will help ensure the message is not misinterpreted in the future. However, when trying to explain it to the user in the moment, one of the shortcomings in the message is that there's a lack of context about what may have triggered it.

$ zq -Z 'status==500' zlake.txt 
{
    level: "warn",
    ts: 1.6855728930280771e+09,
    logger: "core",
    msg: "Error",
    request_id: "2QZtNmA0Bwzkq63F5Ct51KXK1Bl",
    status: 500,
    error: "pool name missing",
    stacktrace: "github.com/brimdata/zed/service.(*ResponseWriter).Error\n\t/home/runner/work/zed/zed/service/request.go:258\ngithub.com/brimdata/zed/service.handleQuery\n\t/home/runner/work/zed/zed/service/handlers.go:61\ngithub.com/brimdata/zed/service.(*Core).handler.func1\n\t/home/runner/work/zed/zed/service/core.go:188\nnet/http.HandlerFunc.ServeHTTP\n\t/opt/hostedtoolcache/go/1.19.9/x64/src/net/http/server.go:2109\ngithub.com/rs/cors.(*Cors).Handler.func1\n\t/home/runner/go/pkg/mod/github.com/rs/[email protected]/cors.go:219\nnet/http.HandlerFunc.ServeHTTP\n\t/opt/hostedtoolcache/go/1.19.9/x64/src/net/http/server.go:2109\ngithub.com/brimdata/zed/service.panicCatchMiddleware.func1.1\n\t/home/runner/work/zed/zed/service/middleware.go:92\nnet/http.HandlerFunc.ServeHTTP\n\t/opt/hostedtoolcache/go/1.19.9/x64/src/net/http/server.go:2109\ngithub.com/brimdata/zed/service.accessLogMiddleware.func1.1\n\t/home/runner/work/zed/zed/service/middleware.go:72\nnet/http.HandlerFunc.ServeHTTP\n\t/opt/hostedtoolcache/go/1.19.9/x64/src/net/http/server.go:2109\ngithub.com/brimdata/zed/service.requestIDMiddleware.func1.1\n\t/home/runner/work/zed/zed/service/middleware.go:28\nnet/http.HandlerFunc.ServeHTTP\n\t/opt/hostedtoolcache/go/1.19.9/x64/src/net/http/server.go:2109\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/home/runner/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210\ngithub.com/brimdata/zed/service.(*Core).ServeHTTP\n\t/home/runner/work/zed/zed/service/core.go:234\nnet/http.serverHandler.ServeHTTP\n\t/opt/hostedtoolcache/go/1.19.9/x64/src/net/http/server.go:2947\nnet/http.(*conn).serve\n\t/opt/hostedtoolcache/go/1.19.9/x64/src/net/http/server.go:1991"
}

For instance, if that log message had included the IP address of the client that sent the request, it would have helped us divide & conquer, i.e., we could have seen the IP of their zync server or of a known Zui desktop client and then look for corresonding logs at that location.

As it turns out, it's possible to piece this together today in an environment with default log settings, in that we could search for the request_id in this message, which pairs it up with the Request completed message that does include the remote_addr field for the client IP (199.83.220.169 in this case).

$ zq -Z '"2QZtNmA0Bwzkq63F5Ct51KXK1Bl"' zlake.txt 
{
    level: "warn",
    ts: 1.6855728930280771e+09,
    logger: "core",
    msg: "Error",
    request_id: "2QZtNmA0Bwzkq63F5Ct51KXK1Bl",
    status: 500,
    error: "pool name missing",
    stacktrace: "github.com/brimdata/zed/service.(*ResponseWriter).Error\n\t/home/runner/work/zed/zed/service/request.go:258\ngithub.com/brimdata/zed/service.handleQuery\n\t/home/runner/work/zed/zed/service/handlers.go:61\ngithub.com/brimdata/zed/service.(*Core).handler.func1\n\t/home/runner/work/zed/zed/service/core.go:188\nnet/http.HandlerFunc.ServeHTTP\n\t/opt/hostedtoolcache/go/1.19.9/x64/src/net/http/server.go:2109\ngithub.com/rs/cors.(*Cors).Handler.func1\n\t/home/runner/go/pkg/mod/github.com/rs/[email protected]/cors.go:219\nnet/http.HandlerFunc.ServeHTTP\n\t/opt/hostedtoolcache/go/1.19.9/x64/src/net/http/server.go:2109\ngithub.com/brimdata/zed/service.panicCatchMiddleware.func1.1\n\t/home/runner/work/zed/zed/service/middleware.go:92\nnet/http.HandlerFunc.ServeHTTP\n\t/opt/hostedtoolcache/go/1.19.9/x64/src/net/http/server.go:2109\ngithub.com/brimdata/zed/service.accessLogMiddleware.func1.1\n\t/home/runner/work/zed/zed/service/middleware.go:72\nnet/http.HandlerFunc.ServeHTTP\n\t/opt/hostedtoolcache/go/1.19.9/x64/src/net/http/server.go:2109\ngithub.com/brimdata/zed/service.requestIDMiddleware.func1.1\n\t/home/runner/work/zed/zed/service/middleware.go:28\nnet/http.HandlerFunc.ServeHTTP\n\t/opt/hostedtoolcache/go/1.19.9/x64/src/net/http/server.go:2109\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/home/runner/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210\ngithub.com/brimdata/zed/service.(*Core).ServeHTTP\n\t/home/runner/work/zed/zed/service/core.go:234\nnet/http.serverHandler.ServeHTTP\n\t/opt/hostedtoolcache/go/1.19.9/x64/src/net/http/server.go:2947\nnet/http.(*conn).serve\n\t/opt/hostedtoolcache/go/1.19.9/x64/src/net/http/server.go:1991"
}
{
    level: "info",
    ts: 1.6855728930282407e+09,
    logger: "http.access",
    msg: "Request completed",
    request_id: "2QZtNmA0Bwzkq63F5Ct51KXK1Bl",
    host: "18.222.122.101:9867",
    method: "POST",
    proto: "HTTP/1.1",
    remote_addr: "199.83.220.169:53588",
    request_content_length: 79,
    url: "/query?ctrl=true",
    elapsed: 0.00536952,
    response_content_length: 55,
    status_code: 500
}

However, I see two limitations with this:

It requires the user to know how to do this pairing on their end, or have them send us their entire log so we do the pairing on our end. Yes, we could augment our docs to teach users how to do these things. However, since many users will often send just the one "bad-looking" log message as the first step in their debug, packing the extra context into the more severe error reduces the number of debug round-trips with us (and increases the likelihood that they could figure it out on their end without reaching out to us.)
In the case of this community zync user, they have their log level set to warn rather than the default info, so they'd never have had the Request completed message to pair with. They changed from the default log level precisely because their zync workload caused such an excessively high number of these chatty messages like Request completed that are benign for the myriad healthy executed queries, so once again, rather than having them go through a debug iteration where they turn the info logging back on just to get this detail, it'd be great if it could be provided up front.

May 31 '23 23:05 philrz

I don't think remote_addr belongs in the log error message here and the access log is the appropriate place. If we're going to add remote_addr why not include proto or host or url to every log message? The reason is that if you have too many fields in your log message it becomes increasingly difficult to visually parse.

In the case of this particular zync user worried about the chattiness of their logs (which I totally get), I'd suggest that they use a custom logger configuration and write the access logs to a separate file and keep the warn level logs in their console output.

EDIT I just checked a we don't currently have a way of writing separate loggers to different paths using a logger yaml config. IIRC we used to have this but the feature seems to have disappeared in one of the refactors. I think we should add this feature back.

Mar 01 '24 16:03 mattnibs