chproxy icon indicating copy to clipboard operation
chproxy copied to clipboard

[BUG] Error message "concurrent query failed"

Open Steuf opened this issue 3 years ago • 9 comments

Describe the bug I use chproxy with redis cache (keydb in fact, but it's the same), chproxy are used between clickhouse and dataviz (Grafana). But I have randomly the error "concurrent query failed" (and this error are cached :/).

This message are not really explicit, when I see the code the error was fired by the cache code. What can be the origin of this error and how I can solve it ?

Expected behavior No error and request dont failed

Environment information chproxy: 1.17.0 Clickhouse: 22.8.3.13 Keydb: v6.3.1

Kubernetes cluster

Steuf avatar Sep 06 '22 15:09 Steuf

Hi, this error means that you tryed to run the same query at the same time (or the first query then the 2nd one while the first one is still running). Instead of running twice the same query into clickhouse, chproxy will run the first one then wait for the result of the first query to use the cached result from the 2nd one. But if the first one failed (for example because of a timeout), chproxy will return "concurrent query failed" for the 2nd query and sometimes it can still return the same anwser if you run query 2 after query one failed (it's a mechanism useful to avoid users doing the same query to often while clickhouse is under water)

mga-chka avatar Sep 06 '22 15:09 mga-chka

To follow up on the answer from @mga-chka , we can consider an enhancement to this mechanism by storing error message of originally failed query. It would be more intuitive to user and hide the technical detail on how we optimise execution of concurrent queries.

I'm marking this issue as an enhancement and will try to work on it in a near future.

gontarzpawel avatar Sep 06 '22 15:09 gontarzpawel

I also stepped on it. Despite the unclear error message, this result is cached for more than confugured cache TTL

For example, i'm setting 60s for the cache TTL and 30s for the max_execution_time setting Then i'm running a long query, getting the error:

Received exception from server:
[ Id: 1712D2E3DCBFD77D; User "..."(1) proxying as "..."(1) to "..."(6); RemoteAddr: "..."; LocalAddr: ".."; Duration: 30030411 μs]: timeout for user "..." exceeded: 30s; query: "..."

After 60s (cache TTL) i have 2 keys in redis:

redis> keys *
1) "118713aa48f49bd1fa292673691d416d-transaction"
2) "b804d509c7159fa1923eb66f7609da0c-transaction"
redis> get 118713aa48f49bd1fa292673691d416d-transaction                                                                                       
"2"
redis> get b804d509c7159fa1923eb66f7609da0c-transaction                                                                                       
"1"
redis> TTL 118713aa48f49bd1fa292673691d416d-transaction                                                                                       
(integer) 172719
redis TTL b804d509c7159fa1923eb66f7609da0c-transaction                                                                                       
(integer) 172714
redis>

Running the query again resulting to the error: "concurrent query failed"

"2" means "unsuccessfull query" i guess, but somehow this result is not stored for the cache TTL (60s), but for 86400*2seconds, which is weird. I have 86400s max_execution_time set for other users in chproxy, but not for the user i'm running the test queries.

Also, caching the error is not always a good idea, because error may happen for the different reasons: a) max memory limit for CH instance reached, but after few mins memory is released b) execution speed is low due to the load spike, so CH says query won't fit into max_execution_time limit (this behavior is controlled by timeout_before_checking_execution_speed setting) ... and many others.

In this case, when client want to re-try the query after a while, it will constantly recieve the error without a chance to re-try, which is really bad.

IMO, errors caching should be disabled, or, at least, there should be a setting to explicitly control this behavior.

PhantomPhreak avatar Sep 08 '22 08:09 PhantomPhreak

Hello @PhantomPhreak ,

Thanks for sharing. Keys with the suffix -transaction relate to the feature of concurrent transactions. When query arrives to the 'cacheable' chproxy it firstly starts so called transaction. The purpose of this transaction is prevention from thundering herd effect as such that the concurrent request relating to the exactly same query will await for the result of the computation from the first request.

When Clickhouse responds upon the first query, the key with the -transaction suffix is updated accordingly:

  • if succeeded, as completed
  • if failed, as failed

This information in redis is kept for 2 * grace_time or 2 * max_execution_time. It has no correlation with cache expiration / ttl whatsoever.

Also, caching the error is not always a good idea, because error may happen for the different reasons: a) max memory limit for CH instance reached, but after few mins memory is released b) execution speed is low due to the load spike, so CH says query won't fit into max_execution_time limit (this behavior is controlled by timeout_before_checking_execution_speed setting) ... and many others.

Very relevant point. We've come to similar reflections except we had thought that it is still worth to protect Clickhouse from fatal queries (unrecoverable). We've introduced this idea in https://github.com/ContentSquare/chproxy/pull/193

The concept is to not mark transaction as failed if received response from Clickhouse is recoverable.

var clickhouseRecoverableStatusCodes = map[int]struct{}{http.StatusServiceUnavailable: {}}

Following your feedback, it may be indeed relevant to add http status codes representing timeout (408). Regarding memory limit it's more tricky to configure because it's not curried in a distinct http status code, but within 500.

gontarzpawel avatar Sep 11 '22 12:09 gontarzpawel

to protect Clickhouse from fatal queries (unrecoverable).

This is a good point, but there are a big question how to separate recoverable and unrecoverable queries. Also, if unrecoverable queries don't do any harm to the ClickHouse itself, it may be safe to allow them, and control the query execution rate by max requests per minute/max concurrent queries on CHproxy side, or even a quotas and query complexity settings in ClickHouse itself. We're doing that quite succefully with no problems.

Following your feedback, it may be indeed relevant to add http status codes representing timeout (408). Regarding memory limit it's more tricky to configure because it's not curried in a distinct http status code, but within 500.

I need to say that the scope of potential causes may be wider, and now, for example, query will be marked as failed if i just press Ctrl+C in clickhouse-cli during the query execution, and if just repeat this query "as is" after, it doesn't work, which is very annying.

IMO, if some query was failed, and another query came after that - it's not a concurrent queries, so this "new" query should not fail because of "concurrent query protection logic/etc.", just because it's not applicable in this case. These are sequential requests, not concurrent ones.

I may miss some details in the implementation (sorry for the stupid proposals if any), but this behavior looks logical for me: all concurrent queries (waiting for the "grace_period") will fail if currently running query fails (with the error/timeout received from ClickHouse, not because of the user cancelled the query), but following queries (re-try) will be allowed to execute.

PhantomPhreak avatar Sep 13 '22 08:09 PhantomPhreak

Thanks for update Guys, I have tested release 1.17.2, at start I was unreadable logs, and I think it was caused by the http compression actived on Clickhouse :

Capture d’écran 2022-09-19 à 18 59 12

I have deactivated HTTP Compression on CH to have "readable" error but now I juste have "empty" error :/ :

Capture d’écran 2022-09-19 à 19 01 02

Have you an idea that he can be caused "empty" reporting on concurrency error ?

NB: Nothing on Clickhouse logs :/

Steuf avatar Sep 19 '22 17:09 Steuf

I think I have found the pattern that generate this behavior.

Sometime on Grafana, query are immediatly canceled by the client and re-fired.

Capture d’écran 2022-09-20 à 14 23 09

In this case it always generate the concurrency query failed error on chproxy.

Steuf avatar Sep 20 '22 12:09 Steuf

yep, this is the issue @PhantomPhreak was mentionning (with the ctrl +c). FYI we are planning to modify this part (and at least let the chproxy admin to customize this part with the config.yml). cf https://github.com/ContentSquare/chproxy/issues/230

mga-chka avatar Sep 20 '22 18:09 mga-chka

👍

Steuf avatar Sep 21 '22 13:09 Steuf

FYI v1.18.0 should improve this issue since now a failing call will be cached for 500 msec. cf https://github.com/ContentSquare/chproxy/pull/235

I'm closing this issue since the new behavior of chproxy (the root cause error of the "cached" error is shown & the error is cached only for 500 msec to avoid dogpile effect) should solve most cases. Feel free to open it again if needed.

mga-chka avatar Oct 15 '22 18:10 mga-chka