clickhouse-operator icon indicating copy to clipboard operation
clickhouse-operator copied to clipboard

ClickHouse Operator leaves orphan S3 files when scaling down replicas that use S3-backed MergeTree

Open hodgesrm opened this issue 1 year ago • 3 comments

When scaling down replicas clickhouse-operator does not ensure that S3 files that back MergeTree tables are fully deleted. This results in orphan files in the S3 bucket. This behavior was tested using ClickHouse 24.3.2.3 and clickhouse-operator 0.23.3.

Here's how to reproduce in general, followed by a detailed scenario.

  1. Create a ClickHouse cluster with two replicas (replicaCount=2) with a storage policy that allows data to be stored on S3.
  2. Run DDL to create a replicated table that uses S3 storage.
  3. Add data to the table.
  4. Confirm that data is stored in S3.
  5. Change the replicaCount to 1 and update the CHI resource definition.
  6. Drop the replicated table on the remaining replica.
  7. Check data in the S3 bucket. You will see orphan files.

To reproduce in detail use the examples in https://github.com/Altinity/clickhouse-sql-examples/tree/main/using-s3-and-clickhouse. Here is a detailed script.

# Grab sample code. 
git clone https://github.com/Altinity/clickhouse-sql-examples
cd clickhouse-sql-examples/using-s3-and-clickhouse
# Generate S3 credentials in a secret. (See script header for instructions.)
./generate-s3-secret.sh
# Create the cluster. 
kubectl apply -f demo2-s3-01.yaml
# Wait for both pods to come up, then run the following commands. 
./port-forward-2.sh
alias cc-batch='clickhouse-client -m -n --verbose -t --echo -f Pretty'
cc-batch < sql-11-create-s3-tables.sql
cc-batch < sql-12-insert-data.sql
cc-batch < sql-03-statistics.sql
# Check the data in S3 using a command like the following. Note the number of objects. 
# Run this command until the number of S3 files stops growing. The sample inserts via a distributed table. 
# In my sample runs I get 3392 file and 4.3 GiB data stored in S3. 
aws s3 ls --recursive --human-readable --summarize s3://<bucket>/clickhouse/mergetree/
# Scale down the replicaCount from 2 to 1 and apply. 
kubectl edit chi demo2
# Check the data in S3 again. It should not have changed.  
aws s3 ls --recursive --human-readable --summarize s3://<bucket>/clickhouse/mergetree/

You can now prove that S3 files are orphaned and see which ones they are. One way is as follows.

  1. On the remaining ClickHouse server run truncate table test_s3_direct_local;.
  2. Check the S3 files. About half of them remain. In my sample runs there were 1707 files and 2.1 GiB of data remaining.

hodgesrm avatar Apr 07 '24 21:04 hodgesrm

It appears that one workaround for this problem is to drop tables explicitly before decommissioning the replica. For example, you can login to the departing replica and issue the following command:

DROP TABLE test_s3_direct_local SYNC

It's unclear whether SYNC helps fully because it's not documented in the official docs but the Altinity KB indicates that it drops table data synchronously. Anyway, when I run this command before scaling down the S3 files are properly removed.

hodgesrm avatar Apr 07 '24 22:04 hodgesrm

Final notes:

  1. The reproduction described above did not use zero-copy replication.
  2. This issue also extends to ordinary MergeTree files. It appears the operator only deletes ReplicatedMergeTree tables, replicated databases, views, or dictionaries. See https://github.com/Altinity/clickhouse-operator/blob/master/pkg/model/chi/schemer/sql.go#L31 for details.

hodgesrm avatar Apr 07 '24 22:04 hodgesrm

Fixed in 0.24.0

alex-zaitsev avatar Jun 13 '24 08:06 alex-zaitsev

Released in https://github.com/Altinity/clickhouse-operator/releases/tag/release-0.23.7

alex-zaitsev avatar Aug 12 '24 19:08 alex-zaitsev