elastic icon indicating copy to clipboard operation
elastic copied to clipboard

tlsHandshakeTimeoutError potentially invoking unwanted IsContextErr flow?

Open CarloDeSutter opened this issue 5 years ago • 2 comments

Please use the following questions as a guideline to help me answer your issue/question without further inquiry. Thank you.

Which version of Elastic are you using?

[ ] elastic.v7 (for Elasticsearch 7.x) [X ] elastic.v6 (for Elasticsearch 6.x) [ ] elastic.v5 (for Elasticsearch 5.x) [ ] elastic.v3 (for Elasticsearch 2.x) [ ] elastic.v2 (for Elasticsearch 1.x)

This might be important for this issue: sniffing is disabled.

Please describe the expected behavior

The tlsHandshakeTimeoutError might have to be reassessed, it might be better in this case to treat is an error that results in retrying followed by another connection selection, so it doesn't immediately give up.

Please describe the actual behavior

The client is performing regular search requests by selecting one of the connected nodes as known by the ES client. However, when one of the nodes drops and returns, there might be a small time window where the following happens:

  1. A client performs another request and selects that returned node
  2. A TLS handshake timeout error occurs.
  3. The following code snippet is activated when the following error occurs:

client.go, method PerformRequest(ctx context.Context, opt PerformRequestOptions) invokes

func IsContextErr(err error) bool {
	if err == context.Canceled || err == context.DeadlineExceeded {
		return true
	}
	// This happens e.g. on redirect errors, see https://golang.org/src/net/http/client_test.go#L329
	if ue, ok := err.(*url.Error); ok {
		if ue.Temporary() {
			return true
		}
		// Use of an AWS Signing Transport can result in a wrapped url.Error
		return IsContextErr(ue.Err)
	}
	return false
}

More detailed: the error is evaluated as *url.Error and the return true path is executed because the tlsHandshakeTimeoutError struct implements Temporary() as always returning true.

The issue here is that we believe these lines of code were not intended for this type of error (tlsHandshakeTimeoutError). It will now immediately give up the request and not retry, let alone retry followed by selecting another connection.

Any steps to reproduce the behavior?

What I have done to mimic this:

  1. Add https://localhost with a self-signed certificate as one of the configured ES nodes, implementing a mocked version of one of the POST search query involved (could be any), plus a HEAD so health checks can be enabled.
  2. Spin an es client with one valid ES node, one localhost node URL configured.
  3. Let it perform its health checks
  4. Perform the POST call a few times, this should use each node at will and both should work.
  5. Shutdown and reboot the localhost process
  6. Perform the POST call, the client will now initiate another TLS handshake session for the revived localhost node, break on line if trace != nil && trace.TLSHandshakeStart != nil { for at least 10s exceeding the default Transport TLShandshake timeout.
  7. The error tlsHandshakeTimeoutError will be sent back by the code all the way up to the es client code.
  8. The IsContextErr() flow as documented above will be executed.
  9. No other node will be selected and this request is given up.

Another way (but likely a bit harder) is to restart a real node in the ES cluster added by the scenario that a TLS handshake timeout error occurs. We have not yet tried this approach though we think this scenario should come very close if not equal to the actual scenario we see happening in our live environments.

EDIT: a more simple way (this will only show the code flow though, not actually reproduce end-to-end) is to set a very low timeout before performing the request:

c.c.Transport = &http.Transport{
	TLSHandshakeTimeout: 1 * time.Microsecond,
}

Just adding it in for the sake of being complete.

CarloDeSutter avatar Mar 03 '20 09:03 CarloDeSutter

Any news or info on this ?

bm-carlodesutter avatar Jul 16 '20 13:07 bm-carlodesutter

Any updates on this?

akr-rzp avatar May 27 '22 12:05 akr-rzp