bcachefs
bcachefs copied to clipboard
multiple performance degradations in the last 6 months
hey there,
we've been playing around with bcachefs for over a year as a possible future candidate for our multi-PB storage setup. We regularly compile upstream kernels and test tiered file system configurations. We always use the same disk layout and run the same fio
performance test. Between August 2023 and today we've seen 2 significant performance degradations which effectively halved bcachefs' IOPS and throughput during that time. If this is to be expected since you're not optimizing for performance yet, please ignore and close this issue. If not, here's the data:
the system is an old (2015ish) test file server with a 16T HDD HW RAID6 split into 5 volume sets (sda1 to sda5) and 2 380G caching SSDs (sdb and sdc) that we assemble in the following way:
bcachefs format --compression=lz4 --replicas=2 --label=hdd --durability=2 /dev/sda1 /dev/sda2 /dev/sda3 /dev/sda4 /dev/sda5 --label=ssd --durability=1 /dev/sdb /dev/sdc --foreground_target=ssd --promote_target=ssd --background_target=hdd --fs_label=data
On the resulting file system, we always run the same fio
test:
fio --filename=randomrw --size=1GB --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=40 --time_based --group_reporting --name=iops-test-job --eta-newline=1 > output.txt
The kernel is always compiled on the same Debian Bookworm using BW's 6.1 .config
+ make oldconfig
. Back in August 2023 we pulled the OOB bcachefs source and got
iops-test-job: (groupid=0, jobs=40): err= 0: pid=715781: Fri Aug 18 13:36:56 2023
read: IOPS=27.5k, BW=107MiB/s (113MB/s)(12.6GiB/120006msec)
then with 6.7pre
(as soon as the bcachefs source was upstreamed) it was
iops-test-job: (groupid=0, jobs=40): err= 0: pid=2702: Mon Nov 13 10:43:35 2023
read: IOPS=22.2k, BW=86.6MiB/s (90.8MB/s)(10.1GiB/120004msec)
and now with 6.8rc2
it's
iops-test-job: (groupid=0, jobs=40): err= 0: pid=2050: Mon Jan 29 07:13:29 2024
read: IOPS=14.0k, BW=54.7MiB/s (57.3MB/s)(6563MiB/120004msec)
The values are pretty consistent (+- 1 MB/s). We also see a performance drop if we create a bcachefs on just one SSD.
How much trouble would it be for you to bisect?
On rc2, the biggest change was that we switched to issuing flush ops correctly; that will have an impact.
We'll need to simplify the setup and establish a baseline; what performance are you seeing just testing on your SSD? And what is the SSD capable of?
How much trouble would it be for you to bisect?
I know it exists, haven't done it yet. I can only work on this on the side, so it would have to be largely automated...
We'll need to simplify the setup and establish a baseline; what performance are you seeing just testing on your SSD? And what is the SSD capable of?
as I said, I occasionally also tested on just one SSD and it got slower as well. I would presume everyone else would see a similar behavior (IIRC there has been talk about reduced performance on Phoronix when CONFIG_BCACHEFS_DEBUG
was introduced).
I've reprod it; I'm seeing a 50% perf regression since 6.7 with random_writes, if I don't use no_data_io mode. Bisecting now.
The debugging option that Phoronix was testing with is no longer an issue - I fixed the performance overhead of that code, so it's now always on and the option has been removed.
The debugging option that Phoronix was testing with is no longer an issue - I fixed the performance overhead of that code, so it's now always on and the option has been removed.
I see. Good to know.
Can you give the bcachefs-testing branch a try? I just pushed a patch to improve journal pipelining; when testing 4k random writes with high iodepth, this is a drastic performance improvement - ~200k iops to 560k iops
so I compiled bcachefs-testing
including 30792a137600d56957c2491a60879d5e95bbf1ef, but I'm afraid that doesn't make much of a difference:
iops-test-job: (groupid=0, jobs=40): err= 0: pid=867: Thu Feb 1 08:26:41 2024
read: IOPS=14.6k, BW=57.1MiB/s (59.8MB/s)(6848MiB/120002msec)
vs
iops-test-job: (groupid=0, jobs=40): err= 0: pid=2050: Mon Jan 29 07:13:29 2024
read: IOPS=14.0k, BW=54.7MiB/s (57.3MB/s)(6563MiB/120004msec)
on Monday
Hang on, I missed that you were testing reads. Is this random or sequential?
random:
fio --filename=randomrw --size=1GB --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=40 --time_based --group_reporting --name=iops-test-job --eta-newline=1
FYI I just recompiled bcachefs-v6.5
to make sure I can reproduce the older, faster numbers and get
iops-test-job: (groupid=0, jobs=40): err= 0: pid=856: Mon Feb 5 12:46:59 2024
read: IOPS=23.0k, BW=89.9MiB/s (94.2MB/s)(10.6GiB/120343msec)
also: it seems the version of bcachefs-utils
plays a role. I'm currently on kernel build bcachefs-v6.5
and first created my file system using bcache-utils
from some time in August 2023 (just to go with the oldschool vibe). This resulted in
iops-test-job: (groupid=0, jobs=40): err= 0: pid=2076: Mon Feb 5 12:58:52 2024
read: IOPS=24.0k, BW=93.6MiB/s (98.2MB/s)(11.0GiB/120352msec)
like above. When I create the same FS using bcachefs-utils
HEAD, I get
iops-test-job: (groupid=0, jobs=40): err= 0: pid=6455: Mon Feb 5 13:48:23 2024
read: IOPS=21.9k, BW=85.7MiB/s (89.9MB/s)(10.1GiB/120298msec)
not a huge difference, but noticeable.
Did encodeded_extent_max change? or discard?
Did encodeded_extent_max change? or discard?
between the two bcachefs-utils
versions you mean? Not unless the default changed, I always use the same parameters.
That's what I was asking - can you check the show-super output on your good and bad runs?
1.4.0:
encoded_extent_max: 64.0 KiB
Discard: 0
v0.1-730-g28e6dea:
encoded_extent_max: 64.0 KiB
Discard: 0
I guess it would be better if you post the whole show-super output
small update: 6.9 is back to 6.7 levels (even a bit higher), but still a good ways from August 2023.