Certain files fail to read when accessed directly
Description of problem: I've got an interesting issue with files not being found when accessed directly.
When accessing a file directly like this:
# ls /mnt/path/to/some/file/file.json
ls: cannot access '/mnt/path/to/some/file/file.json': No such file or directory
I get a failure. Doesn't matter what tool I use, or if I try to access it programmatically, the OS still tells me that the file doesn't exist.
I can wait, try again, and same result. I could put the command in a loop with a delay between read attempts, same result.
However, if I do an ls of the directory first then everything works. That is:
# ls /mnt/path/to/some/file/file.json
ls: cannot access '/mnt/path/to/some/file/file.json': No such file or directory
# ls /mnt/path/to/some/file/
dir1 dir2 file1.txt file2.txt file3.txt file.json
At this point I can successfully "ls" the file:
# ls /mnt/path/to/some/file/file.json
/mnt/path/to/some/file/file.json
It is curious, but seems to be isolated to certain directories/mount points.
Furthermore, it appears to be consistent across the directories in question. That is, I can go over to another client mount (different client, same mount point) and repeat the process on the affected directories.
Clients are using native Gluster FUSE mounts
I did recently add the 4th brick, and the system is currently undergoing a rebalance (the fix-layout rebalance already completed).
I am throttling the full rebalance as 'lazy'.
The server load is slightly up due to the rebalance, but still well below (about half) the number of physical CPU cores, so I don't believe this is a load issue that is causing this.
The exact command to reproduce the issue:
ls -lh /mnt/path/to/some/file/file.json
The full output of the command that failed:
ls: cannot access '/mnt/path/to/some/file/file.json': No such file or directory
Expected results:
-rw-rw-r-- 1 user group 1.1K Dec 22 2020 /mnt/path/to/some/file/file.json
Mandatory info:
- The output of the gluster volume info command:
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 6db232a9-e7a1-46ab-bca4-bc69cf5cb68e
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: gluster00:/export/brick0/srv
Brick2: gluster01:/export/brick0/srv
Brick3: gluster00:/export/brick1/srv
Brick4: gluster01:/export/brick1/srv
Brick5: gluster00:/export/brick2/srv
Brick6: gluster01:/export/brick2/srv
Brick7: gluster00:/export/brick3/srv
Brick8: gluster01:/export/brick3/srv
Options Reconfigured:
cluster.rebal-throttle: lazy
server.outstanding-rpc-limit: 512
network.inode-lru-limit: 1000000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
performance.write-behind-window-size: 4MB
performance.io-thread-count: 32
performance.cache-size: 1GB
client.event-threads: 3
server.event-threads: 16
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
- The output of the gluster volume status command:
Status of volume: gv0
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick gluster00:/export/brick0/srv 49152 0 Y 2708
Brick gluster01:/export/brick0/srv 49152 0 Y 2677
Brick gluster00:/export/brick1/srv 49153 0 Y 2725
Brick gluster01:/export/brick1/srv 49153 0 Y 2687
Brick gluster00:/export/brick2/srv 49154 0 Y 2738
Brick gluster01:/export/brick2/srv 49154 0 Y 2701
Brick gluster00:/export/brick3/srv 49155 0 Y 14733
Brick gluster01:/export/brick3/srv 49155 0 Y 19711
Self-heal Daemon on localhost N/A N/A Y 2825
Self-heal Daemon on gluster01 N/A N/A Y 2770
Task Status of Volume gv0
------------------------------------------------------------------------------
Task : Rebalance
ID : 20961d7c-c559-48d8-b9a9-f9393ecd7b4e
Status : in progress
- The output of the gluster volume heal command:
gluster volume heal gv0 info
Brick gluster00:/export/brick0/srv
Status: Connected
Number of entries: 0
Brick gluster01:/export/brick0/srv
Status: Connected
Number of entries: 0
Brick gluster00:/export/brick1/srv
Status: Connected
Number of entries: 0
Brick gluster01:/export/brick1/srv
Status: Connected
Number of entries: 0
Brick gluster00:/export/brick2/srv
Status: Connected
Number of entries: 0
Brick gluster01:/export/brick2/srv
Status: Connected
Number of entries: 0
Brick gluster00:/export/brick3/srv
Status: Connected
Number of entries: 0
Brick gluster01:/export/brick3/srv
Status: Connected
Number of entries: 0
gluster volume heal gv0 info split-brain
Brick gluster00:/export/brick0/srv
Status: Connected
Number of entries in split-brain: 0
Brick gluster01:/export/brick0/srv
Status: Connected
Number of entries in split-brain: 0
Brick gluster00:/export/brick1/srv
Status: Connected
Number of entries in split-brain: 0
Brick gluster01:/export/brick1/srv
Status: Connected
Number of entries in split-brain: 0
Brick gluster00:/export/brick2/srv
Status: Connected
Number of entries in split-brain: 0
Brick gluster01:/export/brick2/srv
Status: Connected
Number of entries in split-brain: 0
Brick gluster00:/export/brick3/srv
Status: Connected
Number of entries in split-brain: 0
Brick gluster01:/export/brick3/srv
Status: Connected
Number of entries in split-brain: 0
**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/ Zero log output related to the failure or success
**- Is there any crash ? Provide the backtrace and coredump No crash
Additional info: Cluster was setup a couple of years ago on gluster 6.x. Did an expansion (two replica pairs --> three replica pairs) and rebalance while on 6.x, no issues. Upgraded to gluster 7.x -> 8.x -> 9.x Ops version has been upgraded to 90000 Just completed another expansion (three replica pairs --> four replica pairs) and rebalance is underway now.
- The operating system / glusterfs version: Server: CentOS Linux release 7.9.2009 (Core) kernel: 3.10.0-1160.25.1.el7.x86_64
gluster packages:
# rpm -qa | grep gluster
centos-release-gluster9-1.0-1.el7.noarch
glusterfs-9.2-1.el7.x86_64
glusterfs-cli-9.2-1.el7.x86_64
glusterfs-client-xlators-9.2-1.el7.x86_64
glusterfs-fuse-9.2-1.el7.x86_64
glusterfs-server-9.2-1.el7.x86_64
libglusterd0-9.2-1.el7.x86_64
libglusterfs0-9.2-1.el7.x86_64
nfs-ganesha-gluster-3.5-1.el7.x86_64
Client: Issue can be replicated on multiple clients:
- Debian 10
- Ubuntu 18.04
- Ubuntu 16.04
Client Gluster packages are all version 9.2
Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
I wanted to follow up here that this issue resolved itself after the rebalance completed, but since we have an extremely large data set, this took some time to complete.
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.