glusterfs
glusterfs copied to clipboard
Slow directory listings over GlusterFS FUSE in Gluster 10.1
Description of problem:
We have a directory with ~4300 files which takes about 15 seconds to run an ll against:
time ll /mnt/gluster/store|wc -l
4352
real 0m14.979s
user 0m0.027s
sys 0m0.059s
This seems excessive. Each of these bricks is an SSD, so drive access should be relatively fast. Listing each brick (distributed-dispersed) is much faster; on the order of 0.05s with most of the same number of directories:
time ll /data/glusterfs/the-volume/brick-4/store/|wc -l
4342
real 0m0.054s
user 0m0.022s
sys 0m0.033s
I've seen this take between 0.01 and 0.055s on a brick, so even with a sequential read of every brick, this shouldn't take more than 2.25 seconds (and iiuc there shouldn't be a sequential read, it should be in parallel). Here's a sequential ll against the offending directory on all of the bricks on one of the three machines:
time for brick in $(sudo gluster volume info the-volume|grep Brick|grep gluster1|awk -F: '{print $3}'); do ll $brick/store > /dev/null; done
real 0m0.774s
user 0m0.316s
sys 0m0.392s
Even cached, the ll on a GlusterFS mount still takes ~7 seconds (when I run the command a couple of times in a row quickly).
I looked at some previous Gluster issues which seemed similar, such as this one which mentions having to do lots of getdents, so I went ahead and did an strace against a ls on this problem directory to see if there is any similar behavior (even though that bug is for gluster 3.x and we're on 10.1 on both client and server).
In my case I see a lot of getdents64 - 2241 of them in the strace I captured. So, most of these are returning only 1 entry! This seems crazy (strace provided on request as noted below).
The exact command to reproduce the issue:
ls
The full output of the command that failed: it didn't fail, it's just slow
Expected results: I would expect that it wouldn't be this slow.
Mandatory info:
- The output of the gluster volume info command:
Volume Name: the-volume
Type: Distributed-Disperse
Volume ID: 26fb3836-8ee2-4e1c-ae6a-8a90698b8bb7
Status: Started
Snapshot Count: 0
Number of Bricks: 15 x (2 + 1) = 45
Transport-type: tcp
Bricks:
Brick1: gluster4:/data/glusterfs/the-volume/brick-2
Brick2: gluster2:/data/glusterfs/the-volume/brick-2/b
Brick3: gluster1:/data/glusterfs/the-volume/brick-2
Brick4: gluster4:/data/glusterfs/the-volume/brick-3
Brick5: gluster2:/data/glusterfs/the-volume/brick-3/b
Brick6: gluster1:/data/glusterfs/the-volume/brick-3
Brick7: gluster4:/data/glusterfs/the-volume/brick-4
Brick8: gluster2:/data/glusterfs/the-volume/brick-4/b
Brick9: gluster1:/data/glusterfs/the-volume/brick-4
Brick10: gluster4:/data/glusterfs/the-volume/brick-5
Brick11: gluster2:/data/glusterfs/the-volume/brick-5/b
Brick12: gluster1:/data/glusterfs/the-volume/brick-5
Brick13: gluster4:/data/glusterfs/the-volume/brick-1
Brick14: gluster2:/data/glusterfs/the-volume/brick-1/b
Brick15: gluster1:/data/glusterfs/the-volume/brick-1
Brick16: gluster4:/data/glusterfs/the-volume/brick-6
Brick17: gluster2:/data/glusterfs/the-volume/brick-6/b
Brick18: gluster1:/data/glusterfs/the-volume/brick-6
Brick19: gluster4:/data/glusterfs/the-volume/brick-7
Brick20: gluster2:/data/glusterfs/the-volume/brick-7/b
Brick21: gluster1:/data/glusterfs/the-volume/brick-7
Brick22: gluster4:/data/glusterfs/the-volume/brick-8
Brick23: gluster2:/data/glusterfs/the-volume/brick-8/b
Brick24: gluster1:/data/glusterfs/the-volume/brick-8
Brick25: gluster4:/data/glusterfs/the-volume/brick-9
Brick26: gluster2:/data/glusterfs/the-volume/brick-9/b
Brick27: gluster1:/data/glusterfs/the-volume/brick-9
Brick28: gluster1:/data/glusterfs/the-volume/brick-10/brick
Brick29: gluster2:/data/glusterfs/the-volume/brick-10/brick
Brick30: gluster4:/data/glusterfs/the-volume/brick-10/brick
Brick31: gluster1:/data/glusterfs/the-volume/brick-11/brick
Brick32: gluster2:/data/glusterfs/the-volume/brick-11/brick
Brick33: gluster4:/data/glusterfs/the-volume/brick-11/brick
Brick34: gluster1:/data/glusterfs/the-volume/brick-12/brick
Brick35: gluster2:/data/glusterfs/the-volume/brick-12/brick
Brick36: gluster4:/data/glusterfs/the-volume/brick-12/brick
Brick37: gluster1:/data/glusterfs/the-volume/brick-13/brick
Brick38: gluster2:/data/glusterfs/the-volume/brick-13/brick
Brick39: gluster4:/data/glusterfs/the-volume/brick-13/brick
Brick40: gluster1:/data/glusterfs/the-volume/brick-14/brick
Brick41: gluster2:/data/glusterfs/the-volume/brick-14/brick
Brick42: gluster4:/data/glusterfs/the-volume/brick-14/brick
Brick43: gluster1:/data/glusterfs/the-volume/brick-15/brick
Brick44: gluster2:/data/glusterfs/the-volume/brick-15/brick
Brick45: gluster4:/data/glusterfs/the-volume/brick-15/brick
Options Reconfigured:
storage.owner-uid: 3333
storage.owner-gid: 3333
cluster.rebal-throttle: lazy
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
performance.client-io-threads: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 200000
performance.readdir-ahead: on
performance.parallel-readdir: on
- The output of the gluster volume status command:
Status of volume: the-volume
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick gluster4:/data/glusterfs/the-volu
me/brick-2 59323 0 Y 2079
Brick gluster2:/data/glusterfs/the-volu
me/brick-2/b 49945 0 Y 2072
Brick gluster1:/data/glusterfs/the-volu
me/brick-2 55244 0 Y 775693
Brick gluster4:/data/glusterfs/the-volu
me/brick-3 52160 0 Y 2083
Brick gluster2:/data/glusterfs/the-volu
me/brick-3/b 59480 0 Y 2079
Brick gluster1:/data/glusterfs/the-volu
me/brick-3 57257 0 Y 775881
Brick gluster4:/data/glusterfs/the-volu
me/brick-4 59794 0 Y 2092
Brick gluster2:/data/glusterfs/the-volu
me/brick-4/b 49511 0 Y 2095
Brick gluster1:/data/glusterfs/the-volu
me/brick-4 55866 0 Y 776128
Brick gluster4:/data/glusterfs/the-volu
me/brick-5 49806 0 Y 2104
Brick gluster2:/data/glusterfs/the-volu
me/brick-5/b 49886 0 Y 2101
Brick gluster1:/data/glusterfs/the-volu
me/brick-5 60343 0 Y 776446
Brick gluster4:/data/glusterfs/the-volu
me/brick-1 49911 0 Y 2109
Brick gluster2:/data/glusterfs/the-volu
me/brick-1/b 58524 0 Y 2116
Brick gluster1:/data/glusterfs/the-volu
me/brick-1 54262 0 Y 611244
Brick gluster4:/data/glusterfs/the-volu
me/brick-6 49605 0 Y 2121
Brick gluster2:/data/glusterfs/the-volu
me/brick-6/b 60200 0 Y 2122
Brick gluster1:/data/glusterfs/the-volu
me/brick-6 53176 0 Y 811671
Brick gluster4:/data/glusterfs/the-volu
me/brick-7 58323 0 Y 2150
Brick gluster2:/data/glusterfs/the-volu
me/brick-7/b 57072 0 Y 2142
Brick gluster1:/data/glusterfs/the-volu
me/brick-7 56213 0 Y 811960
Brick gluster4:/data/glusterfs/the-volu
me/brick-8 57732 0 Y 2144
Brick gluster2:/data/glusterfs/the-volu
me/brick-8/b 53440 0 Y 2148
Brick gluster1:/data/glusterfs/the-volu
me/brick-8 49418 0 Y 812373
Brick gluster4:/data/glusterfs/the-volu
me/brick-9 54655 0 Y 2146
Brick gluster2:/data/glusterfs/the-volu
me/brick-9/b 57784 0 Y 2160
Brick gluster1:/data/glusterfs/the-volu
me/brick-9 49747 0 Y 812824
Brick gluster1:/data/glusterfs/the-volu
me/brick-10/brick 57437 0 Y 1041056
Brick gluster2:/data/glusterfs/the-volu
me/brick-10/brick 50079 0 Y 11329
Brick gluster4:/data/glusterfs/the-volu
me/brick-10/brick 53452 0 Y 7141
Brick gluster1:/data/glusterfs/the-volu
me/brick-11/brick 52977 0 Y 1041505
Brick gluster2:/data/glusterfs/the-volu
me/brick-11/brick 56015 0 Y 11759
Brick gluster4:/data/glusterfs/the-volu
me/brick-11/brick 60303 0 Y 7212
Brick gluster1:/data/glusterfs/the-volu
me/brick-12/brick 50505 0 Y 1042267
Brick gluster2:/data/glusterfs/the-volu
me/brick-12/brick 53435 0 Y 12372
Brick gluster4:/data/glusterfs/the-volu
me/brick-12/brick 54059 0 Y 7292
Brick gluster1:/data/glusterfs/the-volu
me/brick-13/brick 60925 0 Y 1043081
Brick gluster2:/data/glusterfs/the-volu
me/brick-13/brick 51823 0 Y 13058
Brick gluster4:/data/glusterfs/the-volu
me/brick-13/brick 56723 0 Y 7445
Brick gluster1:/data/glusterfs/the-volu
me/brick-14/brick 57201 0 Y 1043815
Brick gluster2:/data/glusterfs/the-volu
me/brick-14/brick 60582 0 Y 13811
Brick gluster4:/data/glusterfs/the-volu
me/brick-14/brick 60141 0 Y 7662
Brick gluster1:/data/glusterfs/the-volu
me/brick-15/brick 59438 0 Y 1044952
Brick gluster2:/data/glusterfs/the-volu
me/brick-15/brick 54259 0 Y 14678
Brick gluster4:/data/glusterfs/the-volu
me/brick-15/brick 51585 0 Y 7886
Self-heal Daemon on localhost N/A N/A Y 812835
Self-heal Daemon on gluster4. N/A N/A Y 2193
Self-heal Daemon on gluster2. N/A N/A Y 2190
Task Status of Volume the-volume
------------------------------------------------------------------------------
Task : Rebalance
ID : 4313b398-a5d7-46e4-883d-ea3147987e98
Status : completed
- The output of the gluster volume heal command:
Error : Request timed out
- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/
There are a lot of these and they're pretty large. Would a sample be OK?
- Is there any crash ? Provide the backtrace and coredump I have a large strace - it's 216KB which seems like it shouldn't be provided in a bug report.
Additional info: Some names/paths have been changed to protect the innocent
- The operating system / glusterfs version: Both client and server are running: Ubuntu 22.04.1 LTS Gluster version 10.1
Hey folks, I was able to reproduce this behavior with a fresh set of Ubuntu 22.10 VMs and Gluster 3.10. I've made a reproduction case using Ansible and created a repo for it. I also included some of the stuff I had excluded for brevity in the original bug report (logs, strace, etc). You can find this repo here. Hopefully this helps to debug the issue!
I tested this on Gluster 11.0 (via an upgrade from 10.1 - I did not create a cluster from scratch using 11 and test that yet) from the Gluster team PPA and it doesn't seem to make a big difference, despite some Gluster release notes claiming improvements to maybe-related code. It's still as slow as ever, as far as I can tell.
Hello folks, I just wanted to try to follow up here. We're still having this problem. I'd be happy to provide any further information requested. Thanks!
Confirming the problem. Just started to test different distributed storage. Listing takes a lot of time. Also gluster volume status`` echo Error : Request timed out```.
Thought glusterfs a promising one.
[2023-04-28 08:21:44.094407 +0000] I [MSGID: 106499] [glusterd-handler.c:4458:__glusterd_handle_status_volume] 0-management: Received status volume req for volume www1
[2023-04-28 08:22:02.704835 +0000] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0x9e952) [0x7f5b32fce952] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0xd48f2) [0x7f5b330048f2] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0xd4352) [0x7f5b33004352] ) 0-management: Lock for www1 held by 7885d813-0968-4e8d-8f1d-12a768288f62
[2023-04-28 08:22:02.704922 +0000] E [MSGID: 106118] [glusterd-syncop.c:1923:gd_sync_task_begin] 0-management: Unable to acquire lock for www1
[2023-04-28 08:22:18.694418 +0000] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0x9e952) [0x7f5b32fce952] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0xd48f2) [0x7f5b330048f2] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0xd4352) [0x7f5b33004352] ) 0-management: Lock for www1 held by 7885d813-0968-4e8d-8f1d-12a768288f62
[2023-04-28 08:22:58.732966 +0000] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0x9e952) [0x7f5b32fce952] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0xd48f2) [0x7f5b330048f2] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0xd4352) [0x7f5b33004352] ) 0-management: Lock for www1 held by 7885d813-0968-4e8d-8f1d-12a768288f62
The message "E [MSGID: 106118] [glusterd-syncop.c:1923:gd_sync_task_begin] 0-management: Unable to acquire lock for www1" repeated 2 times between [2023-04-28 08:22:02.704922 +0000] and [2023-04-28 08:22:58.733063 +0000]
performance.client-io-threads: on
Can you please share the below data?
- Enable profile
gluster v profile
start - Clean the profile
gluster v profile
info clear - Run the test case
- Run the profile again
gluster v profile
info
Please share the current gluster release on which you are executing a test case.
My problem was resolved. It was because of MTU 9000 set on interfaces on servers, while interconnecting switches does not had jumbo frame activated.
I can confirm this problem has not been resolved. strace -Tc ls -hal reveals an excess number of getdents among other directory and stat related syscalls on any fuse mounts associated with GlusterFS
For additional context, I also have MooseFS running in parallel with Gluster on the same exact hardware configuration and the slow directory issues do not exist. Both file systems use FUSE, so most likely not a FUSE issue but something internal to GlusterFS. The problem isn't a small file problem but a many file issue.
Sadly, my servers reside on an air gapped environment and I cannot provide logs, configuration data or console output. However, this issue continues to exist in every version of GlusterFS (7.x - 11.x) regardless of the volume type or configuration options set. Based on the longevity of this issue, I find it highly unlikely it will ever get resolved before Glusters eol.
Can you please share the below data?
- Enable profile gluster v profile start
- Clean the profile gluster v profile info clear
- Run the test case
- Run the profile again gluster v profile info
Please share the current gluster release on which you are executing a test case.
Can you please share the below data?
Enable profile gluster v profile start Clean the profile gluster v profile info clear Run the test case Run the profile again gluster v profile info and share gluster v info output also Please share the current gluster release on which you are executing a test case.
Can you please share the below data?
- Enable profile gluster v profile start
- Clean the profile gluster v profile info clear
- Run the test case
- Run the profile again gluster v profile info
Please share the current gluster release on which you are executing a test case.
Can you please share the below data?
Enable profile gluster v profile start Clean the profile gluster v profile info clear Run the test case Run the profile again gluster v profile info and share gluster v info output also Please share the current gluster release on which you are executing a test case.
Sure thing I will start profiling one of our test volumes today. Unfortunately, I work in an air-gapped environment so providing any dumps is going to be tricky. I will attempt to transpose the data listed in your template, so please bear with me.
My environment is running RHEL 8.9 and GlusterFS 10.3 with a combination of dispersed and distributed-dispersed volumes in a 8 x (2 + 1) configuration.
I think this issue tracks back to the limited buffer size associated with readdir/readdirp syscalls creating a bottleneck. Actual volume performance is relatively good based on the hardware, but FUSE seems to only pull stat/metadata in groups of 1K files at a time.
Can you please share the below data?
- Enable profile gluster v profile start
- Clean the profile gluster v profile info clear
- Run the test case
- Run the profile again gluster v profile info
Please share the current gluster release on which you are executing a test case.
Can you please share the below data? Enable profile gluster v profile start Clean the profile gluster v profile info clear Run the test case Run the profile again gluster v profile info and share gluster v info output also Please share the current gluster release on which you are executing a test case.
Sure thing I will start profile one of test volumes today. Unfortunately, I work in an air-gapped environment so providing any dumps is going to be tricky. I will attempt to transpose the data listed in your template, so please bear with me.
My environment is running RHEL 8.9 and GlusterFS 10.3 with a combination of dispersed and distributed-dispersed volumes in a 8 x (2 + 1) configuration.
I think this issue tracks back to the limited buffer size associated with readdir/readdirp syscalls creating a bottleneck. Actual volume performance is relatively good based on the hardware, but FUSE seems to only pull stat/metadata in groups of 1K files at a time.
After enabling readdir-ahead there should not be an issue with buffe size for readdir(p) systecall because readdir-ahead increase the buffer size to 128k and saved the dentries in their cache.
Just FYI The readdir-ahead works only for readdirp not for readdir.
- [ ]
Can you please share the below data?
- Enable profile gluster v profile start
- Clean the profile gluster v profile info clear
- Run the test case
- Run the profile again gluster v profile info
Please share the current gluster release on which you are executing a test case.
Can you please share the below data? Enable profile gluster v profile start Clean the profile gluster v profile info clear Run the test case Run the profile again gluster v profile info and share gluster v info output also Please share the current gluster release on which you are executing a test case.
Sure thing I will start profile one of test volumes today. Unfortunately, I work in an air-gapped environment so providing any dumps is going to be tricky. I will attempt to transpose the data listed in your template, so please bear with me. My environment is running RHEL 8.9 and GlusterFS 10.3 with a combination of dispersed and distributed-dispersed volumes in a 8 x (2 + 1) configuration. I think this issue tracks back to the limited buffer size associated with readdir/readdirp syscalls creating a bottleneck. Actual volume performance is relatively good based on the hardware, but FUSE seems to only pull stat/metadata in groups of 1K files at a time.
After enabling readdir-ahead there should not be an issue with buffe size for readdir(p) systecall because readdir-ahead increase the buffer size to 128k and saved the dentries in their cache.
Gotcha, thanks for the quick reply. I will verify read-ahead is enabled and working correctly.
- [ ]
Can you please share the below data?
- Enable profile gluster v profile start
- Clean the profile gluster v profile info clear
- Run the test case
- Run the profile again gluster v profile info
Please share the current gluster release on which you are executing a test case.
Can you please share the below data? Enable profile gluster v profile start Clean the profile gluster v profile info clear Run the test case Run the profile again gluster v profile info and share gluster v info output also Please share the current gluster release on which you are executing a test case.
Sure thing I will start profile one of test volumes today. Unfortunately, I work in an air-gapped environment so providing any dumps is going to be tricky. I will attempt to transpose the data listed in your template, so please bear with me. My environment is running RHEL 8.9 and GlusterFS 10.3 with a combination of dispersed and distributed-dispersed volumes in a 8 x (2 + 1) configuration. I think this issue tracks back to the limited buffer size associated with readdir/readdirp syscalls creating a bottleneck. Actual volume performance is relatively good based on the hardware, but FUSE seems to only pull stat/metadata in groups of 1K files at a time.
After enabling readdir-ahead there should not be an issue with buffe size for readdir(p) systecall because readdir-ahead increase the buffer size to 128k and saved the dentries in their cache.
Gotcha, thanks for the quick reply. I will verify read-ahead is enabled and working correctly.
It's readdir-ahead not read-ahead, just FYI.