m3 icon indicating copy to clipboard operation
m3 copied to clipboard

M3Query: User errors in Query should not be logged as error level logs

Open asafm opened this issue 4 years ago • 7 comments

Today in M3Query, user can submit queries with faults:

  • Parse errors: error":"1:110: parse error: missing unit character in duration
  • PromQL.Walk errors: promql.Walk: unhandled node type *parser.SubqueryExpr, ...
  • Deadline issues: context deadline exceeded
  • Connection dropped by client: failed to render results...write tcp 1.1.1.1:7201->2.2.2.2:46908: write: connection reset by peer

Since they are user-input based errors it makes sense to return the error back to the user as a response - as M3Query does now - but avoid logging this as ERROR level log, since it's not an error of m3query - not an actionable by the operator - but an error of the user - legitimate request so to speak.

I suggest to change the logging of those errors to debug and I wanted committer opinion on this before working on this.

asafm avatar Jul 14 '21 07:07 asafm

@asafm yeah this seems like a reasonable thing to do. Before you dive in, curious on what log / where you plan on implementing this change. Mostly want to make sure we're in agreement that all logs going through that point would actually merit being INFO (or DEBUG) rather than ERROR.

wesleyk avatar Jul 21 '21 17:07 wesleyk

Thanks for responding @wesleyk . In the following comments I'll list the positions I wish logging level to change:

https://github.com/m3db/m3/blob/82b342908e4cf3a280e9898457381e1cfdc7634d/src/query/api/v1/handler/prom/read.go#L122-L129

Example error:

"msg":"error creating query","error":"1:54: parse error: missing unit character in duration"
context deadline exceeded

This should change from error to debug, since as as an operator you don't need to know about every fault in query composition your user has made

asafm avatar Jul 25 '21 14:07 asafm

https://github.com/m3db/m3/blob/82b342908e4cf3a280e9898457381e1cfdc7634d/src/query/api/v1/handler/prometheus/native/read.go#L136-L150

Example error:

"msg":"m3 query error","error":"promql.Walk: unhandled node type *parser.SubqueryExpr, avg(net_bytes_recv{host=~\"vkapp1\",interface=~\"(all|eth0|eth1)\"})[5m:10s]"

Here I think we can just straight out change log level to debug, since there might be errors which are unrelated to parsing or execution? We can refactor to differentiate between error emanating from the query it self, to unexpected errors not related to the query at all. Here I need a bit of your help to understand the course of action.

asafm avatar Jul 25 '21 14:07 asafm

https://github.com/m3db/m3/blob/82b342908e4cf3a280e9898457381e1cfdc7634d/src/query/api/v1/handler/prometheus/native/read.go#L213-L219

Example error:

"msg":"failed to render results","trace_id":"1bb297dd289e23c8","span_id":"1bb297dd289e23c8","rqID":"857da2be-ee0a-4dea-a46e-1f45c4317294","error":"write tcp 10.9.70.214:7201->10.9.25.73:46908: write: connection reset by peer"

Here it depends on the error. The error I've seen for example, which shouldn't be logged is "connection reset by peer", since it's perfectly find for a client to get disconnected. Here I would make sure the exception is of the network error type and make sure it's a connection reset by peer, and if so, only log it as debug. I would actually apply it to any place in the code which serves HTTP and is doing logging, but we can start with this location.

WDYT @wesleyk ?

asafm avatar Jul 25 '21 14:07 asafm

@wesleyk

asafm avatar Aug 04 '21 10:08 asafm

hey sorry for the delay, do you want to go ahead and put out a PR for this? I would prefer we keep these as INFO though, rather than DEBUG, as they can be helpful to reference

wesleyk avatar Aug 19 '21 15:08 wesleyk

INFO sounds fair for those types. I will write the PR for this

asafm avatar Oct 19 '21 07:10 asafm