beats icon indicating copy to clipboard operation
beats copied to clipboard

Unnecessary debug/error logged during idle connection teardown

Open strawgate opened this issue 1 year ago • 1 comments

When an idle connection is torn down by the beat, Error reading from connection: read tcp y.y.y.y:54668->x.x.x.x:443: use of closed network connection is logged by github.com/elastic/elastic-agent-libs/transport.(*loggingConn).Read"}

Here: https://github.com/elastic/elastic-agent-libs/blob/01275338dc278335b4a8c0f23055014b4c0702dc/transport/logging.go#L48-L54

This appears to be called during the keep alive session maintenance every time data is read from the buffer? The last data in the buffer is the 200 OK from the server so we are receiving the full response.

When the connection is torn down, this receives an errNetClosing instead of an EOF (likely because we are the ones closing the connection and not the server. This code path only checks for EOF and thus we get the use of closed network connection err debug logged during the read.

Image

This error message “pollutes” our debug logs and leads customers to believing there is a network issue.

strawgate avatar Sep 13 '24 19:09 strawgate

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

elasticmachine avatar Sep 13 '24 19:09 elasticmachine

This appears to be called during the keep alive session maintenance every time data is read from the buffer? The last data in the buffer is the 200 OK from the server so we are receiving the full response.

This error does not happen only in this situation, it can happen with a legit interrupted connection by either side. I don't think it's a good idea to fully conceal this error. It would make it difficult to find some networking issues (proxy?) on the customer's side.

So, we cannot simply change https://github.com/elastic/elastic-agent-libs/blob/01275338dc278335b4a8c0f23055014b4c0702dc/transport/logging.go#L48-L54 to ignore the error. If we want to address this issue right, we need to find the way to handle this error in those "keep alive session maintenance" requests instead. The error is simply returned from the Read function, so it can be handled level higher.

rdner avatar Oct 23 '24 09:10 rdner

I think we should find the places where those "keep alive requests" happen and send them without the logger connection wrapper at all. We should not care about their success generally.

For example, we could add a new function Connection() net.Conn to the logging wrapper that returns the underlying connection, that you can use without logging. That would need some ugly type casting though, so I'm not sure.

rdner avatar Oct 23 '24 09:10 rdner

I think we should find the places where those "keep alive requests" happen and send them without the logger connection wrapper at all. We should not care about their success generally.

The keepalives are most probably on the connections to the Elasticsearch _bulk endpoint in the Beats. That is pretty much the only network connection they regularly make.

The idle connection timeout is managed within the net/http implementation: https://cs.opensource.google/go/go/+/master:src/net/http/transport.go;drc=77e42fdeaf98d241dc09f4eb92e09225aea8f4c3;l=1090

cmacknz avatar Oct 23 '24 21:10 cmacknz

This appears to be called during the keep alive session maintenance every time data is read from the buffer? The last data in the buffer is the 200 OK from the server so we are receiving the full response.

@strawgate sounds like you spent some time debugging it, could you give us some pointers? Where did you see this HTTP request and how did you read the buffer?

rdner avatar Oct 24 '24 07:10 rdner

If you apply the breakpoint at https://github.com/elastic/elastic-agent-libs/blob/4babafd5ed1e5079acf74212ed3da01740b22de7/transport/logging.go#L50 and then wait for it to hit, do one step out, you'll land in net/http/transport.go at Read and you can read the buffer from the vscode debugger under p: Image

Example: "HTTP/1.1 200 OK\r\nContent-Encoding: gzip\r\nContent-Type: application/json\r\nX-Cloud-Request-Id: hx4DEVDVTJettHh_3xDChw\r\nX-Elastic-Product: Elasticsearch\r\nX-Found-Handling-Cluster: 856e0d3694624bf18dffce04dd4505f0\r\nX-Found-Handling-Instance: instance-0000000002\r\nDate: Thu, 24 Oct 2024 14:24:19 GMT\r\nContent-Length: 124\r\n\r\n\x1f\x8b\b\x00\x00\x00\x00\x00\x00\x00\xec\x971\n\x80@\f\x04\xff\x92\xfa\n\xb5\xbc\xaf\x88\xc5!\x11\x04EHbu\xe4\xef\xde'\xec\xa6L҅dw\xa7\x8b\x9a=\xe6R\x8fv\xb9\x169C\xefQ\xad]v\xd3\x16*\xb5\x8bG\x8bw4\x97i\xce,L\xd8\x01w\xc0/\xa0\ah\"\xbe\x807\x92\x0f\xc8H\xe4D\xb22\xbc\x00\x19AF\x90\x11d\xf4\v\x19m\xf9\x01\x00\x00\xff\xff\x03\x00PY\xab8\xac\x11\x00\x00\xcd1f-\x0e\f\x7f\x1e\xd8\xd1\x04\xc5\xfa\xbaTNXoN6\xac\xc1\t\xffvˋ\xb9\xae\xf0\xd7\xc7\x04\xaa\xfe\x12r\xeb(\xf3\x1e\xd8\xf7\xf4\xc7\xe7\x04\xfadt\xd2*\xa1Fǎϩd\xc15\x92k\x94J\xd4?\x8d\x13...+3584 more"

strawgate avatar Oct 24 '24 14:10 strawgate

When I first looked at this I wondered if we could catch this by not logging when we get an error but also a non-zero n indicating we actually read data. Yes the connection closed, also we read something from it, so the fact that it closed isn't interesting yet.

https://github.com/elastic/elastic-agent-libs/blob/4babafd5ed1e5079acf74212ed3da01740b22de7/transport/logging.go#L50-L52

func (l *loggingConn) Read(b []byte) (int, error) {
	n, err := l.Conn.Read(b)
	if err != nil && !errors.Is(err, io.EOF) {
		l.logger.Debugf("Error reading from connection: %v", err)
	}
	return n, err
}

cmacknz avatar Oct 24 '24 20:10 cmacknz

When an idle connection is torn down by the beat, Error reading from connection: read tcp y.y.y.y:54668->x.x.x.x:443: use of closed network connection is logged by github.com/elastic/elastic-agent-libs/transport.(*loggingConn).Read"}

Hello @strawgate, can you list the steps to reproduce this error. This was required to test the fix, thanks

khushijain21 avatar Nov 04 '24 10:11 khushijain21

I believe all you need to do is start a relatively new metricbeat (i was using main) with default settings (enable the system module) with debug logging enabled logging.level: debug and this will get logged continuously.

strawgate avatar Nov 04 '24 16:11 strawgate

When I first looked at this I wondered if we could catch this by not logging when we get an error but also a non-zero n indicating we actually read data. Yes the connection closed, also we read something from it, so the fact that it closed isn't interesting yet.

I tried this locally, but it still throws an error. The buffer may or may not be empty when keep alive session attempts to read - it is not guaranteed.

I think we should find the places where those "keep alive requests" happen and send them without the logger connection wrapper at all. We should not care about their success generally.

The logger wrapper is attached to connections made to ES. https://github.com/khushijain21/beats/blob/main/libbeat/esleg/eslegclient/connection.go#L164. I think we may risk losing "all" connection details if we remove it

Any suggestion is appreciated.

khushijain21 avatar Nov 05 '24 15:11 khushijain21

Just to summarize my thoughts on this issue:

  1. We should definitely keep the log wrapper, we should not consider removing it
  2. The log message must stay because it's generic, if the connection is closed because of a server crash, we still need to log it.
  3. The problem here is not the error message but that we have the idle timeout in the connection now that closes the connection and then tries to use a closed connection. See where it was introduced for more details – this is what we need to find a fix for.

rdner avatar Nov 07 '24 15:11 rdner

After looking into internals, @khushijain21 and I decided to just remove the log line and let the consumer of the Read function handle the error (which would probably be logged where necessary).

rdner avatar Nov 13 '24 15:11 rdner