Deduplication doesn't seem to work for ReplicatedMergeTree tables

Open malikas05 opened this issue 1 year ago • 0 comments

In the #1406 issue, I asked how to handle retries. After some discussion, I learnt that data blocks are deduplicated for the ReplicatedMergeTree tables. Since we have Distributed tables in front of our local tables, we revised the sharding keys for the tables to use table columns, so that the same data is sent to the same replicas. However, later I found that it's not necessarily required to retry the writes into the same replica for the data block to be deduplicated as per this(https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication) as the checksums are stored in ZK:

Data blocks are deduplicated. For multiple writes of the same data block (data blocks of the same size containing the same rows in the same order), the block is only written once. The reason for this is in case of network failures when the client application does not know if the data was written to the DB, so the INSERT query can simply be repeated. It does not matter which replica INSERTs were sent to with identical data. INSERTs are idempotent. Deduplication parameters are controlled by merge_tree server settings.

We have encountered a situation in production again which resulted in more data to be persisted. What caused the writes to fail was a rolling restart of the clickhouse servers(we have 3 shards, 2 replicas each) and our service using clickhouse jdbc client reported that the write wasn't successful, so we retried until it succeeded:

09:54:36.562 [***-writer-0] WARN  *** - Failed to write 69177 of 69177 stats due to recoverable error; will retry
***: java.sql.BatchUpdateException: Code: 242, e.displayText() = DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/***-2/***) (version 21.8.14.5 (official build))
, server ClickHouseNode [uri=https://***:8124/***, options={sslrootcert=***.crt,use_server_time_zone=false,use_time_zone=UTC}]@1814630837
...

We were hoping that even though we retry with the same data, deduplication would safeguard against duplicate data; however, this doesn't seem the case. To check whether deduplication kicked off, we reviewed the logs on the servers and looked for entries similar to these, yet there were none:

..17:52:45.064974.. Block with ID all_7615936253566048997_747463735222236827 already exists locally as part all_0_0_0; ignoring it.

Is it possible that since the table on a certain replica to which the initial write call was directed was in readonly mode, it received the data, saved it(but didn't distribute to other replicas yet) but didn't saved the data block checksum in ZK, so when we retried again and even though the data was directed to the same replica(though as I understood it is not required to use the same replica for retries), the data was written again because technically there was no checksum for this data block? If that's not possible, what else could be the culprit here? Any other suggestions how to handle the retry scenarios to achieve reliable writes, i.e., providing atomicity? We cannot use transactions as I believe they don't support replicated tables, so the only reliable way for us is to retry the whole batch because as we understood from the previous conversation, clickhouse jdbc client doesn't provide strong guarantees on what has succeeded or failed in the batch. I stumbled upon this config(https://clickhouse.com/docs/en/operations/settings/settings#insert_quorum) but I believe it would affect performance for inserts.

Environment

Client version: 0.4.6
Language version: Java 17

ClickHouse server

ClickHouse Server version: 21.8.14

Apr 17 '24 18:04 malikas05