geesefs
geesefs copied to clipboard
The mounted CloudFlare R2 sooner or later stucks completely
During the last week, I've experienced multiple stuck with a mounted R2 bucket. It happens always in the end during the writing.
The command to mount I've tried does not have something special, I've tried w/ and w/o --cheap
option
geesefs --endpoint=https://id.r2.cloudflarestorage.com --shared-config=/home/ubuntu/.r2_auth --memory-limit=550 bucket r2
The shared storage is the following:
[default]
aws_access_key_id = ***
aws_secret_access_key =***
The rsync process stuck around 11:22, but can't say more precise
building file list ...
1402 files to consider
cannot delete non-empty directory: deb/dists/stable/main/binary-arm64
cannot delete non-empty directory: deb/dists/stable/main/binary-arm64
cannot delete non-empty directory: deb/dists/stable/main/binary-amd64
cannot delete non-empty directory: deb/dists/stable/main/binary-amd64
cannot delete non-empty directory: deb/dists/stable/main
cannot delete non-empty directory: deb/dists/stable/main
cannot delete non-empty directory: deb/dists/stable
cannot delete non-empty directory: deb/dists/stable
cannot delete non-empty directory: deb/dists/lts/main/binary-arm64
cannot delete non-empty directory: deb/dists/lts/main/binary-arm64
cannot delete non-empty directory: deb/dists/lts/main/binary-amd64
cannot delete non-empty directory: deb/dists/lts/main/binary-amd64
cannot delete non-empty directory: deb/dists/lts/main
cannot delete non-empty directory: deb/dists/lts/main
cannot delete non-empty directory: deb/dists/lts
cannot delete non-empty directory: deb/dists/lts
cannot delete non-empty directory: deb/dists
cannot delete non-empty directory: rpm/lts/repodata
deb/pool/main/c/clickhouse/clickhouse-client_22.7.6.74_amd64.deb
75,152 100% 40.42MB/s 0:00:00 (xfr#1, to-chk=1107/1402)
deb/pool/main/c/clickhouse/clickhouse-client_22.7.6.74_arm64.deb
75,146 100% 6.51MB/s 0:00:00 (xfr#2, to-chk=1106/1402)
deb/pool/main/c/clickhouse/clickhouse-client_22.8.6.71_amd64.deb
75,274 100% 4.79MB/s 0:00:00 (xfr#3, to-chk=1105/1402)
deb/pool/main/c/clickhouse/clickhouse-client_22.8.6.71_arm64.deb
75,274 100% 1.84MB/s 0:00:00 (xfr#4, to-chk=1104/1402)
deb/pool/main/c/clickhouse/clickhouse-client_22.9.3.18_amd64.deb
86,612 100% 1.84MB/s 0:00:00 (xfr#5, to-chk=1099/1402)
deb/pool/main/c/clickhouse/clickhouse-client_22.9.3.18_arm64.deb
86,622 100% 1.59MB/s 0:00:00 (xfr#6, to-chk=1098/1402)
deb/pool/main/c/clickhouse/clickhouse-common-static-dbg_22.7.6.74_amd64.deb
872,235,938 100% 18.30MB/s 0:00:45 (xfr#7, to-chk=1079/1402)
deb/pool/main/c/clickhouse/clickhouse-common-static-dbg_22.7.6.74_arm64.deb
648,380,416 79% 7.38MB/s 0:00:22 # it's stuck here
Here's a log file geesefs.log
Sep 30 11:21:00 ip-172-31-90-73 /usr/bin/geesefs[551]: main.ERROR Failed to flush part 3 of object deb/pool/main/c/clickhouse/.clickhouse-common-static-dbg_22.7.6.74_amd64.deb.TdSQcV: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your secret access key and signing method. #012#011status code: 403, request id: , host id:
Something's wrong with the signature or with the access key...
Here https://community.cloudflare.com/t/r2-signaturedoesnotmatch-the-request-signature-we-calculated-does-not-match-the-signature-you-provided/383646 people say that it may be caused by doing more than 5 requests per second O_o
Is there a way to limit the rate? It will be 1000% ok for us, and better than the completely unusable current state
upd
Hmm, I'm trying --max-flushers=1,--max-parallel-parts=1,--max-parallel-copy=1
now
upd2
It works, quite slowly though. Will try to set them to 2.
Ok... but it's not a safe way to limit RPS because if requests finish quicker than 200ms there will still be more than 5 rps :) Cloudflare is doing wrong, they should return HTTP 429 in that case. It's specially intended for rate limiting in S3...
Thanks, I've created the topic on their forum! https://community.cloudflare.com/t/r2-returning-error-signaturedoesnotmatch-instead-of-http-code-429/423199
@vitalif Just in case, I want to say respect to you for GeeseFS and for supporting it! It is a really great product, the performance is unmatched!
Thanks :)
Just a reference for anybody looking for proper settings, that's what works for us:
packages /a/path/to/mounted/r2 fuse.geesefs _netdev,user_id=1000,group_id=1000,--cheap,--file-mode=0666,--dir-mode=0777,--endpoint=https://accountid.r2.cloudflarestorage.com,--shared-config=/a/path/to/.r2_auth,--memory-limit=2050,--gc-interval=100,--max-flushers=2,--max-parallel-parts=3,--max-parallel-copy=2 0 0
I've installed a pretty thin instance, it has only 4GiB of RAM. It works quite solid after the tune.
hmm, unfortunately, even with these settings I have a stuck client during copying files from another r2 directory
Oct 27 08:54:33 ip-172-31-87-196 /usr/bin/geesefs[1710]: main.ERROR Failed to flush part 6 of object rpm/lts/clickhouse-common-static-22.8.7.34.x86_64.rpm: InvalidPart: There was a problem with the multipart upload.#012#011status code: 400, request id: , host id:
Can you suggest some other tunings?
I use the latest version BTW
Hmm, may it be related to a "solved" incident? https://www.cloudflarestatus.com/incidents/894125897dd7
If they return 400 for anything over 5 rps then the only proper way to make it stable is adding request throttling %) Did you ask them about these 400, what did they say?
https://community.cloudflare.com/t/r2-returning-error-signaturedoesnotmatch-instead-of-http-code-429/423199
Nothing yet..
A throttling looks like quite a decent solution
In fact I'm not 100% sure because 400 is not 403... :)
Hello, and happy New Year!
Recently our deployment failed because of some failure, and it was pretty hard to determine how to fix it, but the main reason was the status code: 403
multipart upload failure. One of a few completely failed uploads:
> grep se-common-static_23.12.2.59_amd64.deb /tmp/syslog.1
Jan 5 14:05:46 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to flush part 3 of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your secret access key and signing method. #012#011status code: 403, request id: , host id:
Jan 5 14:06:04 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to flush part 8 of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your secret access key and signing method. #012#011status code: 403, request id: , host id:
Jan 5 14:07:31 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to flush part 29 of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your secret access key and signing method. #012#011status code: 403, request id: , host id:
Jan 5 14:08:42 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to flush part 47 of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your secret access key and signing method. #012#011status code: 403, request id: , host id:
Jan 5 14:09:00 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to flush part 54 of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your secret access key and signing method. #012#011status code: 403, request id: , host id:
Jan 5 14:09:42 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to finalize multi-part upload of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: InvalidPart: One or more of the specified parts could not be found.#012#011status code: 400, request id: , host id:
Jan 5 14:10:13 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to finalize multi-part upload of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: InvalidPart: One or more of the specified parts could not be found.#012#011status code: 400, request id: , host id:
Jan 5 14:10:45 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to finalize multi-part upload of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: InvalidPart: One or more of the specified parts could not be found.#012#011status code: 400, request id: , host id:
Jan 5 14:11:16 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to finalize multi-part upload of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: InvalidPart: One or more of the specified parts could not be found.#012#011status code: 400, request id: , host id:
# the last lines repeated before the mount restart
The failed uploads are currently in the Ongoing Multipart Upload
state
The mount command is currently:
$ ps wwwwufp 3105345
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ubuntu 3105345 1.2 58.6 3790016 2356188 ? Ssl Jan05 49:14 /usr/bin/geesefs packages /home/ubuntu/r2 -o rw,user_id=1000,group_id=1000,--uid=1000,--gid=1000,--cheap,--file-mode=0666,--dir-mode=0777,--endpoint=$URL,--shared-config=$FILE,--memory-limit=2050,--gc-interval=100,--max-flushers=5,--max-parallel-parts=3,--max-parallel-copy=2,dev,suid
$ grep r2 /etc/fstab
packages /home/ubuntu/r2 fuse.geesefs _netdev,user_id=1000,group_id=1000,--uid=1000,--gid=1000,--cheap,--file-mode=0666,--dir-mode=0777,--endpoint=$URL,--shared-config=$FILE,--memory-limit=2050,--gc-interval=100,--max-flushers=5,--max-parallel-parts=3,--max-parallel-copy=2 0 0
The version was updated some time ago, but not the latest one:
$ /usr/bin/geesefs --version
geesefs version 0.38.5
Do you think, it's something new? I think, I've updated the version from 0.32.0 on Nov 24, and we haven't had any issues with that for almost a year.
Reference: https://github.com/ClickHouse/ClickHouse/issues/58556
And, BTW, I can't reproduce the issue with boto3 parallel upload. I use 100 threads to R2 with the standard s3_client.upload_file
. It has done 1000 files uploads in 4 seconds, ~250 requests/second. It feels that the original issue is not the case anymore.
Happy New Year to you too :) It may be an issue reported a couple of times but without a stable reproduction case, and a lot of related code was refu..refactored really hard in 0.39.0, so it's more correct to try again with 0.39.0. It's good if CloudFlare doesn't limit parallelism anymore, you can remove max-parallel-parts and max-flushers in that case.
Thanks, I'll try the latest release https://github.com/yandex-cloud/geesefs/releases/tag/v0.40.0 then.