glusterfs Certain files fail to read when accessed directly

Description of problem: I've got an interesting issue with files not being found when accessed directly.

When accessing a file directly like this:

  # ls /mnt/path/to/some/file/file.json
  ls: cannot access '/mnt/path/to/some/file/file.json': No such file or directory

I get a failure. Doesn't matter what tool I use, or if I try to access it programmatically, the OS still tells me that the file doesn't exist.

I can wait, try again, and same result. I could put the command in a loop with a delay between read attempts, same result.

However, if I do an ls of the directory first then everything works. That is:

  # ls /mnt/path/to/some/file/file.json
  ls: cannot access '/mnt/path/to/some/file/file.json': No such file or directory

  # ls /mnt/path/to/some/file/
  dir1  dir2  file1.txt  file2.txt  file3.txt  file.json

At this point I can successfully "ls" the file:

  # ls  /mnt/path/to/some/file/file.json
  /mnt/path/to/some/file/file.json

It is curious, but seems to be isolated to certain directories/mount points.

Furthermore, it appears to be consistent across the directories in question. That is, I can go over to another client mount (different client, same mount point) and repeat the process on the affected directories.

Clients are using native Gluster FUSE mounts

I did recently add the 4th brick, and the system is currently undergoing a rebalance (the fix-layout rebalance already completed).

I am throttling the full rebalance as 'lazy'.

The server load is slightly up due to the rebalance, but still well below (about half) the number of physical CPU cores, so I don't believe this is a load issue that is causing this.

The exact command to reproduce the issue: ls -lh /mnt/path/to/some/file/file.json

The full output of the command that failed: ls: cannot access '/mnt/path/to/some/file/file.json': No such file or directory

Expected results: -rw-rw-r-- 1 user group 1.1K Dec 22 2020 /mnt/path/to/some/file/file.json

Mandatory info: - The output of the gluster volume info command:

Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 6db232a9-e7a1-46ab-bca4-bc69cf5cb68e
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: gluster00:/export/brick0/srv
Brick2: gluster01:/export/brick0/srv
Brick3: gluster00:/export/brick1/srv
Brick4: gluster01:/export/brick1/srv
Brick5: gluster00:/export/brick2/srv
Brick6: gluster01:/export/brick2/srv
Brick7: gluster00:/export/brick3/srv
Brick8: gluster01:/export/brick3/srv
Options Reconfigured:
cluster.rebal-throttle: lazy
server.outstanding-rpc-limit: 512
network.inode-lru-limit: 1000000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
performance.write-behind-window-size: 4MB
performance.io-thread-count: 32
performance.cache-size: 1GB
client.event-threads: 3
server.event-threads: 16
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

- The output of the gluster volume status command:

Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gluster00:/export/brick0/srv          49152     0          Y       2708 
Brick gluster01:/export/brick0/srv          49152     0          Y       2677 
Brick gluster00:/export/brick1/srv          49153     0          Y       2725 
Brick gluster01:/export/brick1/srv          49153     0          Y       2687 
Brick gluster00:/export/brick2/srv          49154     0          Y       2738 
Brick gluster01:/export/brick2/srv          49154     0          Y       2701 
Brick gluster00:/export/brick3/srv          49155     0          Y       14733
Brick gluster01:/export/brick3/srv          49155     0          Y       19711
Self-heal Daemon on localhost               N/A       N/A        Y       2825 
Self-heal Daemon on gluster01               N/A       N/A        Y       2770 
 
Task Status of Volume gv0
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 20961d7c-c559-48d8-b9a9-f9393ecd7b4e
Status               : in progress

- The output of the gluster volume heal command:

gluster volume heal gv0 info

Brick gluster00:/export/brick0/srv
Status: Connected
Number of entries: 0

Brick gluster01:/export/brick0/srv
Status: Connected
Number of entries: 0

Brick gluster00:/export/brick1/srv
Status: Connected
Number of entries: 0

Brick gluster01:/export/brick1/srv
Status: Connected
Number of entries: 0

Brick gluster00:/export/brick2/srv
Status: Connected
Number of entries: 0

Brick gluster01:/export/brick2/srv
Status: Connected
Number of entries: 0

Brick gluster00:/export/brick3/srv
Status: Connected
Number of entries: 0

Brick gluster01:/export/brick3/srv
Status: Connected
Number of entries: 0

gluster volume heal gv0 info split-brain

Brick gluster00:/export/brick0/srv
Status: Connected
Number of entries in split-brain: 0

Brick gluster01:/export/brick0/srv
Status: Connected
Number of entries in split-brain: 0

Brick gluster00:/export/brick1/srv
Status: Connected
Number of entries in split-brain: 0

Brick gluster01:/export/brick1/srv
Status: Connected
Number of entries in split-brain: 0

Brick gluster00:/export/brick2/srv
Status: Connected
Number of entries in split-brain: 0

Brick gluster01:/export/brick2/srv
Status: Connected
Number of entries in split-brain: 0

Brick gluster00:/export/brick3/srv
Status: Connected
Number of entries in split-brain: 0

Brick gluster01:/export/brick3/srv
Status: Connected
Number of entries in split-brain: 0

**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/ Zero log output related to the failure or success

**- Is there any crash ? Provide the backtrace and coredump No crash

Additional info: Cluster was setup a couple of years ago on gluster 6.x. Did an expansion (two replica pairs --> three replica pairs) and rebalance while on 6.x, no issues. Upgraded to gluster 7.x -> 8.x -> 9.x Ops version has been upgraded to 90000 Just completed another expansion (three replica pairs --> four replica pairs) and rebalance is underway now.

- The operating system / glusterfs version: Server: CentOS Linux release 7.9.2009 (Core) kernel: 3.10.0-1160.25.1.el7.x86_64

gluster packages:

# rpm -qa | grep gluster
centos-release-gluster9-1.0-1.el7.noarch
glusterfs-9.2-1.el7.x86_64
glusterfs-cli-9.2-1.el7.x86_64
glusterfs-client-xlators-9.2-1.el7.x86_64
glusterfs-fuse-9.2-1.el7.x86_64
glusterfs-server-9.2-1.el7.x86_64
libglusterd0-9.2-1.el7.x86_64
libglusterfs0-9.2-1.el7.x86_64
nfs-ganesha-gluster-3.5-1.el7.x86_64

Client: Issue can be replicated on multiple clients:

Debian 10
Ubuntu 18.04
Ubuntu 16.04

Client Gluster packages are all version 9.2

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

Jun 25 '21 02:06 irischad

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

Jan 21 '22 04:01 stale[bot]

I wanted to follow up here that this issue resolved itself after the rebalance completed, but since we have an extremely large data set, this took some time to complete.

Jan 21 '22 16:01 irischad

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

Sep 21 '22 00:09 stale[bot]

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

Oct 22 '22 18:10 stale[bot]