dbt-clickhouse icon indicating copy to clipboard operation
dbt-clickhouse copied to clipboard

Retries should retry on known retriable database errors in addition to connection errors

Open andaag opened this issue 2 years ago • 4 comments

Describe the bug

Based on this description:

retries: [1] # Number of times to retry a "retriable" database exception (such as a 503 'Service Unavailable' error)

I was kinda expecting retries to also retry failures in executing queries. Such as:

DB::Exception: Timeout: Cannot enqueue query on this replica, most likely because replica is busy with previous queue entries.

But looking at the code it looks like retries is set on retry_connection from SQLConnectionManager, which retries on connection failures, not on query failures.

Is this intended? Am I understanding it wrong? This might be a bit of a mix between either a bug report or a feature request, depending on what the intended behavior is! 😀

andaag avatar Jun 29 '23 15:06 andaag

But how do you know if some part of your data is already inserted or not? Making retries in a database that does not have transactions is pretty dangerous, you can end up with duplicates. We had a lot of problems until we disabled retries in code. So our rule of thumb when we write code - all tasks should be idempotent, if cancel task in the middle and rerun it multiple times we should have correct data

simpl1g avatar Jun 29 '23 16:06 simpl1g

You understand it correctly. As you noted, the documentation should be more explicit in that retries only refers to connection failures in cases where it's guaranteed that a retry will not insert duplicates. If ClickHouse returns a "retriable" HTTP status code (in particular, 429, 502, or 503) the query will be retried up to the "retries" configuration value.

As @simpl1g suggested, we don't retry database errors. In theory we could -- the difficulty is in determining what ClickHouse errors are retriable. First, ClickHouse doesn't currently return a separate error code -- it has to be parsed from the content of the error message. Second, there are something like 1000 ClickHouse error codes and there's no definitive list of what is "retriable" and what is not.

That said, this is a potential feature request -- which is probably most appropriate for the underlying drivers (clickhouse-connect and clickhouse-driver).

genzgd avatar Jun 29 '23 18:06 genzgd

But how do you know if some part of your data is already inserted or not? Making retries in a database that does not have transactions is pretty dangerous, you can end up with duplicates. We had a lot of problems until we disabled retries in code.

By specifically retrying error codes where we know it hasn't been applied. We do this with other services today, and based on the description in the readme that's what I assumed this project did too!

First, ClickHouse doesn't currently return a separate error code -- it has to be parsed from the content of the error message. Second, there are something like 1000 ClickHouse error codes and there's no definitive list of what is "retriable" and what is not.

Probably makes sense to do this on a driver level, and have a whitelist of error codes that are known to be retriable. 👍

andaag avatar Jun 30 '23 06:06 andaag