clickhouse-go Concurrency bug in read for cancel

From https://github.com/ClickHouse/clickhouse-go/issues/727#issuecomment-1239665912

We had some interesting scenario as more background for your investigation. This may or may not lead to the problem we had above. Here is the observation for one request (with the v1 driver):
we set the timeout limit to 90s, and sent the query to clickhouse-go client.
from CH server query_log, we found the query arrived the server (QueryStart event) about 89s later.
we assumed that clickhouse-go driver was waiting for an open connection (during relatively busy time, all connections were used).
the clickhouse-go client timed out after 90s, and went away (the CH server noticed that event).
the CH server finished the query in 6+s, and tried to send the results back, and found the client left already, resulting in 210 error code.
The reason I'd like to describe this scenario is that there might be a possibility in the v2 driver, the timed-out connection could be reused by a future request, while the connection didn't get "reset" when it timed out by the client earlier. Is it possible? Thanks.

https://github.com/ClickHouse/clickhouse-go/issues/727#issuecomment-1251261325

Sep 23 '22 15:09 gingerwizard

@ns-gzhang im not clear how the connection can be reused since on cancel we don't release it. Re https://github.com/ClickHouse/clickhouse-go/issues/727#issuecomment-1239665912

The connection is not released back into the idle pool since it has an error. Debugging the following shows a new connection is created.

func TestQueryCancel(t *testing.T) {
	env, err := GetNativeTestEnvironment()
	require.NoError(t, err)
	useSSL, err := strconv.ParseBool(GetEnv("CLICKHOUSE_USE_SSL", "false"))
	require.NoError(t, err)
	port := env.Port
	var tlsConfig *tls.Config
	if useSSL {
		port = env.SslPort
		tlsConfig = &tls.Config{}
	}
	conn, err := GetConnectionWithOptions(&clickhouse.Options{
		Addr: []string{fmt.Sprintf("%s:%d", env.Host, port)},
		Auth: clickhouse.Auth{
			Database: "default",
			Username: env.Username,
			Password: env.Password,
		},
		Compression: &clickhouse.Compression{
			Method: clickhouse.CompressionLZ4,
		},
		TLS:          tlsConfig,
		MaxOpenConns: 1,
		MaxIdleConns: 1,
		Debug:        true,
	})
	require.NoError(t, err)
	// Issue a query which will take 3 secs, cancel after 1 and reissue a query which take 3 secs - check response is q2, not q1

	ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(time.Millisecond*time.Duration(100)))
	defer cancel()
	var queryId uint16
	// query 1
	err = conn.QueryRow(ctx, "SELECT sleep(3) + 1 as query_id").Scan(&queryId)
	require.Error(t, err)
	// query 2
	err = conn.QueryRow(context.Background(), "SELECT sleep(3) + 2 as query_id").Scan(&queryId)
	require.NoError(t, err)
	require.Equal(t, uint16(2), queryId)
}

Sep 25 '22 10:09 gingerwizard

Ok release is called when the deadline is exceeded but its not pushing it back into the pool because of

	if err != nil || time.Since(conn.connectedAt) >= ch.opt.ConnMaxLifetime {

i.e. err is contextdeadlineexceeded

maybe I can simulate a situation where this occurs.

Sep 25 '22 10:09 gingerwizard

Same test using std interface and OpenDB - connection is successfully marked as Bad since its a timeout.

@ns-gzhang how are the timeouts occurring? I'm using context so maybe this a server setting?

Sep 25 '22 11:09 gingerwizard

Thank you Dale for debugging this. It's been a challenge to recreate. We've also been putting a lot of efforts trying to recreate, in a similar environment, as well as a controlled environment, and we were not successful up to this point. So the trouble spot is only our speculation from some external observations when the problem occurred. In a controlled environment, I have tried a few different timeout combinations, including: the context deadline exceeded in driver sooner than server (the main scenario); the app caller went away without fetching the rows (due to context timeout in the caller after return from the client driver); timeout occurred after rows is returned to the caller and caller is trying to fetch the result but the connection timed out... The problem occurred during heavy load times, and the server seemed to finish the query execution while the client went away already. And a new victim query came and just got that result back. That's why we were suspecting a canceling problem. I don't know under what conditions a connection would be returned to the pool and get reused. And what kind of race conditions are possible, is there a possible race condition across the standard database/sql and driver boundaries (the order of marking a connection free v.s. marking error v.s. reuse)? Just some food for thought.

Sep 26 '22 03:09 ns-gzhang

Are you possibly sharing the same context across two queries...i could possibly see how that could cause an issue on cancel. Specifically:

query q1 is cancelled on timeout. It tries to write the cancel signal.
As q1 is writing its cancel signal, q2 modifies the context marking it as not Done() (not sure where this will happen - I need to dig)
q1 completes cancel and then drops into https://github.com/ClickHouse/clickhouse-go/blob/main/conn_process.go#L68 returning nil (as ctx is no longer finished).

I'll see if I can reproduce but it sounds possible.

Sep 26 '22 08:09 gingerwizard

@ns-gzhang i dont think its reuse of a context - the client guards against this. I'm going to put up a branch that closes the connection on cancel. Can you test in your environment? if it resolves the issue we'll close this - frustrating if we can't reproduce however.

Sep 26 '22 12:09 gingerwizard

@ns-gzhang can you test https://github.com/ClickHouse/clickhouse-go/pull/764 and see if it runs without error? If it does, we will merge.

Sep 26 '22 13:09 gingerwizard

Thanks Dale. We will keep trying to recreate it and can only test after recreation. Will let you know. Thanks.

Sep 26 '22 15:09 ns-gzhang

any chance you could try the branch https://github.com/ClickHouse/clickhouse-go/tree/issue_727 @ns-gzhang - see if this solves it. I've added explicit close on cancel.

Sep 27 '22 06:09 gingerwizard

@gingerwizard question: can a canceled connection be reused? If yes, then I'm thinking one scenario: the server sends the result at the same time the client sends the cancel, and they cross over. The client connection buffer would contain the result. The connection is canceled but later reused for another request without cleaning? (this was the scenario implicitly implied during the discussion of #727 ) In this case, your explicit close will definitely fix the problem!

Sep 27 '22 15:09 ns-gzhang

No, a cancelled connection was marked as dirty and would not be reused in the current approach. It would require two queries to effectively take the same connection from the pool. dial and acquire prevent this - a new connection is created if < max open connections, otherwise one is taken off the idle pool. All buffers are local to the connection.

The only real difference with the new code is the buffer is cleared and underlying connection explicitly closed before its returned. I was hoping this might resolve issues at a lower level, where connections might be reused.

Sep 27 '22 19:09 gingerwizard

OK. we are using database/sql standard interface for the most part. And looks like the driver maintains another level of connection pools? Not sure if two connection pools at two levels could cause problems? Here is what database/sql says: If the lifecycle of a connection is different at the two levels, there could be some state that does not satisfy the invariants?

Sep 27 '22 21:09 ns-gzhang

If cancel is not the path to the trouble, under what condition, database/sql standard could release a Conn, and get it reused without possibly cleaning the connection? (ResetSession is called to reset a connection if used before, but that does nothing if the connection is not bad, in the driver code)

Sep 28 '22 00:09 ns-gzhang

so i was refering to the native interface @ns-gzhang its possible there is something in the indirection of sql/database that makes this possible.

We could close the connection on ResetSession. Are you able to test these changes under your production load? otherwise we're making changes and speculating without a reproducible test case

Sep 29 '22 09:09 gingerwizard

@ns-gzhang did the branch version with connection cancellation work?

Oct 14 '22 10:10 gingerwizard

Hi Dale @gingerwizard I have not been able to recreate in a controlled environment yet so I couldn't verify. We will keep trying... It's important to us (and others). Thanks.

Oct 14 '22 16:10 ns-gzhang

@ns-gzhang feel free to reopen this issue if the issue is still reproducible. thank you

Dec 16 '22 21:12 jkaflik

Sure. Thanks @gingerwizard @jkaflik and all for your support and changes made. Really appreciate it.

Dec 17 '22 00:12 ns-gzhang

@ns-gzhang note that we did merge the connection reset code in the event of error. If you update your driver in production maybe you can tell us if it still occurs? This hasn't been released yet but will ping you when it is.

Dec 17 '22 12:12 gingerwizard

Thanks again. I saw that merge. We tried hard to simulate the possible scenarios in our controlled environments using exactly the same CH server version and the client-driver version. I couldn't recreate. I still suspect that concurrency of std database/sql interface and the driver played some role. Friends at Altinity also helped looking into this.

Dec 17 '22 18:12 ns-gzhang

clickhouse-go clickhouse-go copied to clipboard

Concurrency bug in read for cancel

clickhouse-go
clickhouse-go copied to clipboard