cannot heal file with correct md5sum, size, ... - no split-brain?
I git a replica 3 cluster. While some files are present on the raw gluster folder, i cannot access them when mounted (socket not connected). The example file does have the same md5sum and size on all bricks:
$ for i in loc1 loc2 loc3 ; do ssh $i md5sum /var/glusterfs/testfile1.csv ; done
0989e3e21519239ceaff890363626d79 /var/glusterfs/testfile1.csv
0989e3e21519239ceaff890363626d79 /var/glusterfs/testfile1.csv
0989e3e21519239ceaff890363626d79 /var/glusterfs/testfile1.csv
$ for i in loc1 loc2 loc3 ; do ssh $i ls -ahln /var/glusterfs/testfile1.csv ; done
-rw-rw-r-- 2 500 1000 21K 27. Mär 02:16 /var/glusterfs/testfile1.csv
-rw-rw-r-- 2 500 1000 21K 27. Mär 02:16 /var/glusterfs/testfile1.csv
-rw-rw-r-- 2 500 1000 21K 27. Mär 02:16 /var/glusterfs/testfile1.csv
But the extended attributes differ and accessing the files through the mount-point fails:
# getfattr on mounted fs
$ for i in loc1 loc2 loc3 ; do ssh $i sudo LC_ALL=POSIX getfattr -m ^ -d -R -- /data/glusterfs/testfile1.csv ; done
getfattr: /data/glusterfs/testfile1.csv: Transport endpoint is not connected
getfattr: /data/glusterfs/testfile1.csv: Transport endpoint is not connected
getfattr: /data/glusterfs/testfile1.csv: Transport endpoint is not connected
# getfattr on raw glusterfs dir on each brick
$ for i in loc1 loc2 loc3 ; do ssh $i sudo getfattr -m ^ -d -R -- /var/glusterfs/testfile1.csv ; done
# file: var/glusterfs/testfile1.csv
trusted.afr.dirty=0sAAAAAAAAAAAAAAAA
trusted.afr.my_replica-client-0=0sAAAAAgAAAAEAAAAA
trusted.gfid=0s/r0/qkhMRG+Bk/S5oAVmRw==
trusted.gfid2path.a5d90a0fc8fcd6d1="686e017c-e69e-459e-ba14-12222e934fc4/testfile1.csv"
trusted.glusterfs.mdata=0sAQAAAAAAAAAAAAAAAGfkptIAAAAAMJfMHQAAAABn5KbSAAAAADCXzB0AAAAAZ+Sl+QAAAAABSqfb
getfattr: Entferne führenden '/' von absoluten Pfadnamen
# file: var/glusterfs/testfile1.csv
trusted.gfid=0s/r0/qkhMRG+Bk/S5oAVmRw==
getfattr: Entferne führenden '/' von absoluten Pfadnamen
# file: var/glusterfs/testfile1.csv
trusted.afr.dirty=0sAAAAAAAAAAAAAAAA
trusted.afr.my_replica-client-0=0sAAAAAQAAAAAAAAAA
trusted.gfid=0sckqXO8BCR3mfkeVcTXUk4g==
trusted.gfid2path.a5d90a0fc8fcd6d1="686e017c-e69e-459e-ba14-12222e934fc4/testfile1.csv"
trusted.glusterfs.mdata=0sAQAAAAAAAAAAAAAAAGfkptIAAAAAMJfMHQAAAABn5KbSAAAAADCXzB0AAAAAZ+Sm0gAAAAAmTjLv
The file can not be healed using gluster volume heal my_replica full. Healing using split-brain, bigger or mtime fails telling me there is no split-brain.
How can I get out of this? It seems to be clear that a valid (and identic) file exists on all bricks. But what stops SHD from healing? The troubleshooting hints in the documenttaion didn't help me out of this (may be i did not understand them good enough)?
Regards Marco
Just thinking: if i do have a good copy of the file (e.g. in each of the bricks filesystems), is it appropriate to fix the problem manuall like this:
- backup the relevant file(s) from one of the bricks glusterfs folders (/var/glusterfs/...)
- remove the file(s) from all the bricks glusterfs folders
- copy/restore the file(s) from the backup to the mounted gluster fs (/data/glusterfs/...)
Is this a recommended fix when auto healing does not work?
@mlechner Could you attach the mount logs where the transport endpoint connected issue is happening so that we know what the issue could be? Also could you give output of getfattr with -e hex flag?