geesefs icon indicating copy to clipboard operation
geesefs copied to clipboard

The mounted CloudFlare R2 sooner or later stucks completely

Open Felixoid opened this issue 1 year ago • 17 comments

During the last week, I've experienced multiple stuck with a mounted R2 bucket. It happens always in the end during the writing.

The command to mount I've tried does not have something special, I've tried w/ and w/o --cheap option

geesefs --endpoint=https://id.r2.cloudflarestorage.com --shared-config=/home/ubuntu/.r2_auth --memory-limit=550 bucket r2

The shared storage is the following:

[default]
aws_access_key_id = ***
aws_secret_access_key =***

The rsync process stuck around 11:22, but can't say more precise

building file list ... 
1402 files to consider
cannot delete non-empty directory: deb/dists/stable/main/binary-arm64
cannot delete non-empty directory: deb/dists/stable/main/binary-arm64
cannot delete non-empty directory: deb/dists/stable/main/binary-amd64
cannot delete non-empty directory: deb/dists/stable/main/binary-amd64
cannot delete non-empty directory: deb/dists/stable/main
cannot delete non-empty directory: deb/dists/stable/main
cannot delete non-empty directory: deb/dists/stable
cannot delete non-empty directory: deb/dists/stable
cannot delete non-empty directory: deb/dists/lts/main/binary-arm64
cannot delete non-empty directory: deb/dists/lts/main/binary-arm64
cannot delete non-empty directory: deb/dists/lts/main/binary-amd64
cannot delete non-empty directory: deb/dists/lts/main/binary-amd64
cannot delete non-empty directory: deb/dists/lts/main
cannot delete non-empty directory: deb/dists/lts/main
cannot delete non-empty directory: deb/dists/lts
cannot delete non-empty directory: deb/dists/lts
cannot delete non-empty directory: deb/dists
cannot delete non-empty directory: rpm/lts/repodata
deb/pool/main/c/clickhouse/clickhouse-client_22.7.6.74_amd64.deb
         75,152 100%   40.42MB/s    0:00:00 (xfr#1, to-chk=1107/1402)
deb/pool/main/c/clickhouse/clickhouse-client_22.7.6.74_arm64.deb
         75,146 100%    6.51MB/s    0:00:00 (xfr#2, to-chk=1106/1402)
deb/pool/main/c/clickhouse/clickhouse-client_22.8.6.71_amd64.deb
         75,274 100%    4.79MB/s    0:00:00 (xfr#3, to-chk=1105/1402)
deb/pool/main/c/clickhouse/clickhouse-client_22.8.6.71_arm64.deb
         75,274 100%    1.84MB/s    0:00:00 (xfr#4, to-chk=1104/1402)
deb/pool/main/c/clickhouse/clickhouse-client_22.9.3.18_amd64.deb
         86,612 100%    1.84MB/s    0:00:00 (xfr#5, to-chk=1099/1402)
deb/pool/main/c/clickhouse/clickhouse-client_22.9.3.18_arm64.deb
         86,622 100%    1.59MB/s    0:00:00 (xfr#6, to-chk=1098/1402)
deb/pool/main/c/clickhouse/clickhouse-common-static-dbg_22.7.6.74_amd64.deb
    872,235,938 100%   18.30MB/s    0:00:45 (xfr#7, to-chk=1079/1402)
deb/pool/main/c/clickhouse/clickhouse-common-static-dbg_22.7.6.74_arm64.deb
    648,380,416  79%    7.38MB/s    0:00:22 # it's stuck here

Here's a log file geesefs.log

Felixoid avatar Sep 30 '22 11:09 Felixoid

Sep 30 11:21:00 ip-172-31-90-73 /usr/bin/geesefs[551]: main.ERROR Failed to flush part 3 of object deb/pool/main/c/clickhouse/.clickhouse-common-static-dbg_22.7.6.74_amd64.deb.TdSQcV: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your secret access key and signing method. #012#011status code: 403, request id: , host id: 

Something's wrong with the signature or with the access key...

vitalif avatar Sep 30 '22 12:09 vitalif

Here https://community.cloudflare.com/t/r2-signaturedoesnotmatch-the-request-signature-we-calculated-does-not-match-the-signature-you-provided/383646 people say that it may be caused by doing more than 5 requests per second O_o

vitalif avatar Sep 30 '22 12:09 vitalif

Is there a way to limit the rate? It will be 1000% ok for us, and better than the completely unusable current state

upd

Hmm, I'm trying --max-flushers=1,--max-parallel-parts=1,--max-parallel-copy=1 now

upd2

It works, quite slowly though. Will try to set them to 2.

Felixoid avatar Sep 30 '22 13:09 Felixoid

Ok... but it's not a safe way to limit RPS because if requests finish quicker than 200ms there will still be more than 5 rps :) Cloudflare is doing wrong, they should return HTTP 429 in that case. It's specially intended for rate limiting in S3...

vitalif avatar Sep 30 '22 13:09 vitalif

Thanks, I've created the topic on their forum! https://community.cloudflare.com/t/r2-returning-error-signaturedoesnotmatch-instead-of-http-code-429/423199

Felixoid avatar Sep 30 '22 13:09 Felixoid

@vitalif Just in case, I want to say respect to you for GeeseFS and for supporting it! It is a really great product, the performance is unmatched!

alexey-milovidov avatar Sep 30 '22 14:09 alexey-milovidov

Thanks :)

vitalif avatar Sep 30 '22 14:09 vitalif

Just a reference for anybody looking for proper settings, that's what works for us:

packages /a/path/to/mounted/r2 fuse.geesefs _netdev,user_id=1000,group_id=1000,--cheap,--file-mode=0666,--dir-mode=0777,--endpoint=https://accountid.r2.cloudflarestorage.com,--shared-config=/a/path/to/.r2_auth,--memory-limit=2050,--gc-interval=100,--max-flushers=2,--max-parallel-parts=3,--max-parallel-copy=2 0 0

I've installed a pretty thin instance, it has only 4GiB of RAM. It works quite solid after the tune.

Felixoid avatar Oct 04 '22 11:10 Felixoid

hmm, unfortunately, even with these settings I have a stuck client during copying files from another r2 directory

Oct 27 08:54:33 ip-172-31-87-196 /usr/bin/geesefs[1710]: main.ERROR Failed to flush part 6 of object rpm/lts/clickhouse-common-static-22.8.7.34.x86_64.rpm: InvalidPart: There was a problem with the multipart upload.#012#011status code: 400, request id: , host id:

Can you suggest some other tunings?

I use the latest version BTW

Felixoid avatar Oct 27 '22 08:10 Felixoid

Hmm, may it be related to a "solved" incident? https://www.cloudflarestatus.com/incidents/894125897dd7

Felixoid avatar Oct 27 '22 09:10 Felixoid

If they return 400 for anything over 5 rps then the only proper way to make it stable is adding request throttling %) Did you ask them about these 400, what did they say?

vitalif avatar Oct 27 '22 23:10 vitalif

https://community.cloudflare.com/t/r2-returning-error-signaturedoesnotmatch-instead-of-http-code-429/423199

Nothing yet..

A throttling looks like quite a decent solution

Felixoid avatar Oct 28 '22 05:10 Felixoid

In fact I'm not 100% sure because 400 is not 403... :)

vitalif avatar Oct 28 '22 22:10 vitalif

Hello, and happy New Year!

Recently our deployment failed because of some failure, and it was pretty hard to determine how to fix it, but the main reason was the status code: 403 multipart upload failure. One of a few completely failed uploads:

> grep se-common-static_23.12.2.59_amd64.deb /tmp/syslog.1
Jan  5 14:05:46 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to flush part 3 of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your secret access key and signing method. #012#011status code: 403, request id: , host id:
Jan  5 14:06:04 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to flush part 8 of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your secret access key and signing method. #012#011status code: 403, request id: , host id:
Jan  5 14:07:31 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to flush part 29 of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your secret access key and signing method. #012#011status code: 403, request id: , host id:
Jan  5 14:08:42 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to flush part 47 of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your secret access key and signing method. #012#011status code: 403, request id: , host id:
Jan  5 14:09:00 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to flush part 54 of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your secret access key and signing method. #012#011status code: 403, request id: , host id:
Jan  5 14:09:42 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to finalize multi-part upload of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: InvalidPart: One or more of the specified parts could not be found.#012#011status code: 400, request id: , host id:
Jan  5 14:10:13 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to finalize multi-part upload of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: InvalidPart: One or more of the specified parts could not be found.#012#011status code: 400, request id: , host id:
Jan  5 14:10:45 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to finalize multi-part upload of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: InvalidPart: One or more of the specified parts could not be found.#012#011status code: 400, request id: , host id:
Jan  5 14:11:16 ip-172-31-87-196 /usr/bin/geesefs[2820219]: main.ERROR Failed to finalize multi-part upload of object deb/pool/main/c/clickhouse/clickhouse-common-static_23.12.2.59_amd64.deb: InvalidPart: One or more of the specified parts could not be found.#012#011status code: 400, request id: , host id:
# the last lines repeated before the mount restart

The failed uploads are currently in the Ongoing Multipart Upload state image

The mount command is currently:

$ ps wwwwufp 3105345
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ubuntu   3105345  1.2 58.6 3790016 2356188 ?     Ssl  Jan05  49:14 /usr/bin/geesefs packages /home/ubuntu/r2 -o rw,user_id=1000,group_id=1000,--uid=1000,--gid=1000,--cheap,--file-mode=0666,--dir-mode=0777,--endpoint=$URL,--shared-config=$FILE,--memory-limit=2050,--gc-interval=100,--max-flushers=5,--max-parallel-parts=3,--max-parallel-copy=2,dev,suid
$ grep r2 /etc/fstab
packages /home/ubuntu/r2 fuse.geesefs _netdev,user_id=1000,group_id=1000,--uid=1000,--gid=1000,--cheap,--file-mode=0666,--dir-mode=0777,--endpoint=$URL,--shared-config=$FILE,--memory-limit=2050,--gc-interval=100,--max-flushers=5,--max-parallel-parts=3,--max-parallel-copy=2 0 0

syslog.1.gz

The version was updated some time ago, but not the latest one:

$ /usr/bin/geesefs --version
geesefs version 0.38.5

Do you think, it's something new? I think, I've updated the version from 0.32.0 on Nov 24, and we haven't had any issues with that for almost a year.

Reference: https://github.com/ClickHouse/ClickHouse/issues/58556

Felixoid avatar Jan 08 '24 17:01 Felixoid

And, BTW, I can't reproduce the issue with boto3 parallel upload. I use 100 threads to R2 with the standard s3_client.upload_file. It has done 1000 files uploads in 4 seconds, ~250 requests/second. It feels that the original issue is not the case anymore.

Felixoid avatar Jan 09 '24 00:01 Felixoid

Happy New Year to you too :) It may be an issue reported a couple of times but without a stable reproduction case, and a lot of related code was refu..refactored really hard in 0.39.0, so it's more correct to try again with 0.39.0. It's good if CloudFlare doesn't limit parallelism anymore, you can remove max-parallel-parts and max-flushers in that case.

vitalif avatar Jan 09 '24 08:01 vitalif

Thanks, I'll try the latest release https://github.com/yandex-cloud/geesefs/releases/tag/v0.40.0 then.

Felixoid avatar Jan 09 '24 09:01 Felixoid