clickhouse-backup
clickhouse-backup copied to clipboard
Improve performance for download or restore remote backup
May be it's related to #142 but anyway the download/restore from remote speed is slowly then upload/create_remote speed.
Here is the configuration:
general:
remote_storage: s3
max_file_size: 1099511627776
disable_progress_bar: false
backups_to_keep_local: 2
backups_to_keep_remote: 30
log_level: info
allow_empty_backups: false
download_concurrency: 255
upload_concurrency: 255
clickhouse:
username: default
password: ""
host: ip-10-40-2-21
port: 9000
disk_mapping: {}
skip_tables:
- system.*
timeout: 5m
freeze_by_part: false
secure: false
skip_verify: false
sync_replicated_tables: true
skip_sync_replica_timeouts: true
log_sql_queries: false
s3:
access_key: access_key
secret_key: secret_key
bucket: bucket
path: path
endpoint: ""
region: us-east-1
acl: private
force_path_style: false
disable_ssl: false
part_size: 0
compression_level: 1
compression_format: tar
sse: ""
disable_cert_verification: false
storage_class: STANDARD
concurrency: 255
api:
listen: localhost:7171
enable_metrics: true
enable_pprof: false
username: ""
password: ""
secure: false
certificate_file: ""
private_key_file: ""
create_integration_tables: false
The test dataset is almost 14 Gb.
create_remote
works in parallel, cool and fast:
time clickhouse-backup --config /opt/clickhouse/clickhouse-backup/config.yml create_remote full_ch_backup_2021-11-04-TEST
...
2021/11/04 06:10:59 info done backup=full_ch_backup_2021-11-04-TEST duration=17.511s operation=upload size=13.91GiB
real 0m18.537s
user 1m54.922s
sys 0m28.236s
But download/restore
from remote takes much more time than uploading to remote. As you see its longer in almost 6 times:
time clickhouse-backup --config /opt/clickhouse/clickhouse-backup/config.yml download full_ch_backup_2021-11-04-TEST
...
2021/11/04 06:16:00 info done backup=full_ch_backup_2021-11-04-TEST duration=1m53.003s operation=download size=13.89GiB
real 1m53.025s
user 0m20.756s
sys 0m38.276s
I changed download_concurrency
and part_size
to different values - no effect.
How can the download speed be increased?
I really appreciate for your detailed reporting and feedback
Could you provide more context? Which clickhouse-backup version do you use for benchmark? Which environment do you use for benchmark? Do you use AWS S3 or maybe other implementations like Minio?
Be careful with big numbers S3_CONCURRENCY with combination DOWNLOAD_CONCURRENCY / UPLOAD_CONCURRENCY. It could allocates a lot of memory for buffers.
Difference between UPLOAD and DOWNLOAD process from parallelization point of view during upload
- for each table parallel pool -> each data part parallel pool -> upload to one
db/table/disk_name_{num}.tar
file + parallel uploaddb/table/metatada.json
during download
- sequentially download metadata.json for each table (point for optimization in near future!!!)
- for each table parallel pool -> for each
disk_name_X.tar
file pool donwload and unpack on stream (look like could be optimize with github.com/aws/aws-sdk-go/service/s3/s3manager NewDownloader instead ofGetObjectRequest
Which clickhouse-backup version do you use for benchmark?
clickhouse-backup -v
Version: 1.2.1
Git Commit: 38cac6b647f46c3e076650d574eb1f2fb8c3ecf0
Build Date: 2021-10-30
Which environment do you use for benchmark? Do you use AWS S3 or maybe other implementations like Minio?
I use AWS S3.
Be careful with big numbers S3_CONCURRENCY with combination DOWNLOAD_CONCURRENCY / UPLOAD_CONCURRENCY. It could allocates a lot of memory for buffers.
Thanks for that. For example, I have 16 CPU with 128 GB RAM machines. What are recommended values for S3_CONCURRENCY, DOWNLOAD_CONCURRENCY and UPLOAD_CONCURRENCY?
during download
- sequentially download metadata.json for each table (point for optimization in near future!!!)
- for each table parallel pool -> for each
disk_name_X.tar
file pool donwload and unpack on stream (look like could be optimize with github.com/aws/aws-sdk-go/service/s3/s3manager NewDownloader instead ofGetObjectRequest
metadata.json files are pretty small files, so I think its not a problem to download them. As I see here is that a file for big table is downloaded slowly and not in parallel. For example: we have a "big" 7 GB table. On uploading on remote, the progress bar shows this in parallel mode so its very fast, but on downloading the progress bar shows other files are downloaded in parallel but this big file is downloaded sequentially. Hope my explanation is clear for you :)
I have 16 CPU with 128 GB RAM machines. What are recommended values for S3_CONCURRENCY, DOWNLOAD_CONCURRENCY and UPLOAD_CONCURRENCY?
Check how much memory allocated and use DOWNLOAD_CONCURRENCY=8 UPLOAD_CONCURRENCY=8 S3_CONCURRENCY=4 and compare results
Check how much memory allocated and use DOWNLOAD_CONCURRENCY=8 UPLOAD_CONCURRENCY=8 S3_CONCURRENCY=4 and compare results
Ok, will check it.
Sorry, I edited my last answer adding: metadata.json files are pretty small files, so I think its not a problem to download them. As I see here is that a file for big table is downloaded slowly and not in parallel. For example: we have a "big" 7 GB table. On uploading on remote, the progress bar shows this in parallel mode so its very fast, but on downloading the progress bar shows other files are downloaded in parallel but this big file is downloaded sequentially. Hope my explanation is clear for you :)
@malcolm061990 any results with lower concurrency numbers?
I tried to apply multipart concurrency download implementation, unfortunately it requires allocating additional disk space during download, and we can't apply in-memory streaming decompression
You can try to combine
clickhouse-backup create
and
rclone sync
, look to https://rclone.org for details
or use https://github.com/restic/restic to try to incremental backup for /var/lib/clickhouse/backup/
folder
@malcolm061990 any results with lower concurrency numbers?
I tried to apply multipart concurrency download implementation, unfortunately it requires allocating additional disk space during download, and we can't apply in-memory streaming decompression
You can try to combine
clickhouse-backup create
and
rclone sync
, look to https://rclone.org for details or use https://github.com/restic/restic to try to incremental backup for/var/lib/clickhouse/backup/
folder
Sorry, for now I can't test the speed because of our CH is under load test. Will back to that soon, thanks But why does it require additional disk space during download?
But why does it require additional disk space during download?
Currently, we use pool of parallel go-routines, each go-routine download one s3://bucket-name/path/backup_name/db/table/disk_name.tar file.
S3 allow us to use internal library s3manager.NewDownloader
to allow multipart concurrently download, but it need need provide variable "writer" which provide WriteAt() method which could be implemented properly only in os.File type, which allocate disk space, otherwise we will have to allocate much memory to download the entire archive file.
Only one method I see here, we need to change remote storage file format and upload files directly or create archives for each data part instead of each table.
But why does it require additional disk space during download?
Currently, we use pool of parallel go-routines, each go-routine download one s3://bucket-name/path/backup_name/db/table/disk_name.tar file.
S3 allow us to use internal library
s3manager.NewDownloader
to allow multipart concurrently download, but it need need provide variable "writer" which provide WriteAt() method which could be implemented properly only in os.File type, which allocate disk space, otherwise we will have to allocate much memory to download the entire archive file.Only one method I see here, we need to change remote storage file format and upload files directly or create archives for each data part instead of each table.
Thanks for the explanation. For sure it's not a good idea to allocate additional disk space during download.
we need to change remote storage file format and upload files directly
Good idea but its not clear :) What do you mean?
we need to change remote storage file format and upload files directly
Good idea but its not clear :) What do you mean?
Currently each table data create one archive on s3://backup-bucket/path/backup_name/db/table/disk_name.archive.extension
We can try to create archive for each data part (each system.parts element) instead of each table
we need to change remote storage file format and upload files directly
Good idea but its not clear :) What do you mean?
Currently each table data create one archive on
s3://backup-bucket/path/backup_name/db/table/disk_name.archive.extension
We can try to create archive for each data part (each system.parts element) instead of each table
If it doesn't break anything it will be cool
@malcolm061990 now, 1.5.x version released which have upload_by_parts: true
and download_by_parts: true
in general section
could you try it and compare download benchmark? currently, i close issue after inactivity
but please comment issue if you have any information about performance comparsion