bcachefs icon indicating copy to clipboard operation
bcachefs copied to clipboard

multiple performance degradations in the last 6 months

Open daduke opened this issue 1 year ago • 24 comments

hey there,

we've been playing around with bcachefs for over a year as a possible future candidate for our multi-PB storage setup. We regularly compile upstream kernels and test tiered file system configurations. We always use the same disk layout and run the same fio performance test. Between August 2023 and today we've seen 2 significant performance degradations which effectively halved bcachefs' IOPS and throughput during that time. If this is to be expected since you're not optimizing for performance yet, please ignore and close this issue. If not, here's the data: the system is an old (2015ish) test file server with a 16T HDD HW RAID6 split into 5 volume sets (sda1 to sda5) and 2 380G caching SSDs (sdb and sdc) that we assemble in the following way:

bcachefs format --compression=lz4 --replicas=2 --label=hdd --durability=2 /dev/sda1 /dev/sda2 /dev/sda3 /dev/sda4 /dev/sda5 --label=ssd --durability=1 /dev/sdb /dev/sdc --foreground_target=ssd --promote_target=ssd --background_target=hdd --fs_label=data

On the resulting file system, we always run the same fio test:

fio --filename=randomrw --size=1GB --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=40 --time_based --group_reporting --name=iops-test-job --eta-newline=1 > output.txt

The kernel is always compiled on the same Debian Bookworm using BW's 6.1 .config + make oldconfig. Back in August 2023 we pulled the OOB bcachefs source and got

iops-test-job: (groupid=0, jobs=40): err= 0: pid=715781: Fri Aug 18 13:36:56 2023
  read: IOPS=27.5k, BW=107MiB/s (113MB/s)(12.6GiB/120006msec)

then with 6.7pre (as soon as the bcachefs source was upstreamed) it was

iops-test-job: (groupid=0, jobs=40): err= 0: pid=2702: Mon Nov 13 10:43:35 2023
  read: IOPS=22.2k, BW=86.6MiB/s (90.8MB/s)(10.1GiB/120004msec)

and now with 6.8rc2 it's

iops-test-job: (groupid=0, jobs=40): err= 0: pid=2050: Mon Jan 29 07:13:29 2024
  read: IOPS=14.0k, BW=54.7MiB/s (57.3MB/s)(6563MiB/120004msec)

The values are pretty consistent (+- 1 MB/s). We also see a performance drop if we create a bcachefs on just one SSD.

daduke avatar Jan 29 '24 06:01 daduke

How much trouble would it be for you to bisect?

koverstreet avatar Jan 29 '24 07:01 koverstreet

On rc2, the biggest change was that we switched to issuing flush ops correctly; that will have an impact.

We'll need to simplify the setup and establish a baseline; what performance are you seeing just testing on your SSD? And what is the SSD capable of?

koverstreet avatar Jan 29 '24 07:01 koverstreet

How much trouble would it be for you to bisect?

I know it exists, haven't done it yet. I can only work on this on the side, so it would have to be largely automated...

daduke avatar Jan 29 '24 07:01 daduke

We'll need to simplify the setup and establish a baseline; what performance are you seeing just testing on your SSD? And what is the SSD capable of?

as I said, I occasionally also tested on just one SSD and it got slower as well. I would presume everyone else would see a similar behavior (IIRC there has been talk about reduced performance on Phoronix when CONFIG_BCACHEFS_DEBUG was introduced).

daduke avatar Jan 29 '24 07:01 daduke

I've reprod it; I'm seeing a 50% perf regression since 6.7 with random_writes, if I don't use no_data_io mode. Bisecting now.

koverstreet avatar Jan 29 '24 07:01 koverstreet

The debugging option that Phoronix was testing with is no longer an issue - I fixed the performance overhead of that code, so it's now always on and the option has been removed.

koverstreet avatar Jan 29 '24 07:01 koverstreet

The debugging option that Phoronix was testing with is no longer an issue - I fixed the performance overhead of that code, so it's now always on and the option has been removed.

I see. Good to know.

daduke avatar Jan 29 '24 07:01 daduke

Can you give the bcachefs-testing branch a try? I just pushed a patch to improve journal pipelining; when testing 4k random writes with high iodepth, this is a drastic performance improvement - ~200k iops to 560k iops

koverstreet avatar Jan 31 '24 20:01 koverstreet

so I compiled bcachefs-testing including 30792a137600d56957c2491a60879d5e95bbf1ef, but I'm afraid that doesn't make much of a difference:

iops-test-job: (groupid=0, jobs=40): err= 0: pid=867: Thu Feb  1 08:26:41 2024
  read: IOPS=14.6k, BW=57.1MiB/s (59.8MB/s)(6848MiB/120002msec)

vs

iops-test-job: (groupid=0, jobs=40): err= 0: pid=2050: Mon Jan 29 07:13:29 2024
  read: IOPS=14.0k, BW=54.7MiB/s (57.3MB/s)(6563MiB/120004msec)

on Monday

daduke avatar Feb 01 '24 07:02 daduke

Hang on, I missed that you were testing reads. Is this random or sequential?

koverstreet avatar Feb 01 '24 08:02 koverstreet

random:

fio --filename=randomrw --size=1GB --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=40 --time_based --group_reporting --name=iops-test-job --eta-newline=1

daduke avatar Feb 01 '24 08:02 daduke

FYI I just recompiled bcachefs-v6.5 to make sure I can reproduce the older, faster numbers and get

iops-test-job: (groupid=0, jobs=40): err= 0: pid=856: Mon Feb  5 12:46:59 2024
  read: IOPS=23.0k, BW=89.9MiB/s (94.2MB/s)(10.6GiB/120343msec)

daduke avatar Feb 05 '24 11:02 daduke

also: it seems the version of bcachefs-utils plays a role. I'm currently on kernel build bcachefs-v6.5 and first created my file system using bcache-utils from some time in August 2023 (just to go with the oldschool vibe). This resulted in

iops-test-job: (groupid=0, jobs=40): err= 0: pid=2076: Mon Feb  5 12:58:52 2024
  read: IOPS=24.0k, BW=93.6MiB/s (98.2MB/s)(11.0GiB/120352msec)

like above. When I create the same FS using bcachefs-utils HEAD, I get

iops-test-job: (groupid=0, jobs=40): err= 0: pid=6455: Mon Feb  5 13:48:23 2024
read: IOPS=21.9k, BW=85.7MiB/s (89.9MB/s)(10.1GiB/120298msec)

not a huge difference, but noticeable.

daduke avatar Feb 05 '24 12:02 daduke

Did encodeded_extent_max change? or discard?

koverstreet avatar Feb 05 '24 18:02 koverstreet

Did encodeded_extent_max change? or discard?

between the two bcachefs-utils versions you mean? Not unless the default changed, I always use the same parameters.

daduke avatar Feb 06 '24 05:02 daduke

That's what I was asking - can you check the show-super output on your good and bad runs?

koverstreet avatar Feb 06 '24 05:02 koverstreet

1.4.0:
encoded_extent_max:                       64.0 KiB
Discard:                                                0

v0.1-730-g28e6dea:
encoded_extent_max:                       64.0 KiB
Discard:                                                0

daduke avatar Feb 06 '24 06:02 daduke

I guess it would be better if you post the whole show-super output

colttt avatar Feb 06 '24 08:02 colttt

small update: 6.9 is back to 6.7 levels (even a bit higher), but still a good ways from August 2023.

daduke avatar May 13 '24 10:05 daduke