clickhouse-backup icon indicating copy to clipboard operation
clickhouse-backup copied to clipboard

Upload to GCS performance improvement for large parts

Open Vytenis-Valutkevicius opened this issue 1 year ago • 4 comments

Background: In our ClickHouse cluster we have several larger tables with rather small number of parts. For example table of approximately 400GB size and around 20 parts. This means that some parts might have a size of around 150GB. For backup uploads to GCS Buckets this means that these larger parts currently are uploaded as a single stream and these uploads might take a multiple hours to complete (backup process sometimes takes over 12 hours).

Suggestion: To improve upload to GCS performance I would suggest to implement multipart parallel uploads to GSS buckets. This should be possible to accomplish with Google XML API - https://cloud.google.com/storage/docs/xml-api/post-object-multipart However, Google storage SDK used for uploads has this open issue https://github.com/googleapis/google-cloud-go/issues/3219. Suggesting that parallel multipart upload should be implemented into storage SDK Transfer Manager class. This would simplify the parallel multipart implementation for large table parts.

In general my questions in this would be as follows:

  • Is this something you are aware about?
  • Could this be a future candidate for improvement?
  • If needed we could provide some possible workarounds we are looking into.

Vytenis-Valutkevicius avatar Oct 18 '24 11:10 Vytenis-Valutkevicius

Also to mention. We started investigating this due to running out of disk space. For that we recommend to look at the --delete-source flag if that suits your application. Non the less a huge performance boost would be very beneficial without working around reorganising the sizing and structure of our table/part strategy.

eidmantas avatar Oct 18 '24 11:10 eidmantas

@Vytenis-Valutkevicius

Is this something you are aware about?

yes, we are aware about upload speed. and yes, we watch to https://github.com/googleapis/google-cloud-go/issues/3219, unfortunately no progress from Google for a long time, maybe I understand SDK behavior wrong, in any case,

Could this be a future candidate for improvement?

yes, we are welcome to pull requests

If needed we could provide some possible workarounds we are looking into.

As possible workaround, you could use remote_storage: custom and gcloud as transfer utility Look examples for rsync in https://github.com/Altinity/clickhouse-backup/tree/master/test/integration/rsync/ and https://github.com/Altinity/clickhouse-backup/blob/master/test/integration/config-custom-rsync.yml

you need write list.sh, download.sh, upload.sh, delete.sh for using gcloud

Slach avatar Oct 18 '24 13:10 Slach

Hi @Slach,

I've been following this thread and noticed that the GCP Storage SDK has now released the transfermanager package with support for parallel uploads and downloads:

This appears to address the originally requested feature from googleapis/google-cloud-go#3219.

My question: Is anyone currently working on implementing this transfermanager functionality into clickhouse-backup for GCS uploads?

This would significantly help with our large part uploads.

Thanks!

Modestas6 avatar Nov 21 '25 07:11 Modestas6

nobody works with this feature, feel free to make pull request

Slach avatar Nov 23 '25 19:11 Slach