glusterfs
glusterfs copied to clipboard
Gluster volume brick status: "Transport endpoint is not connected"
Description of problem: When checking the gluster volume heal info command on multiple volumes and multiple bricks, we are presented with the following error on some bricks:
Brick virtual-machine-73.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Transport endpoint is not connected
Number of entries: -
What is noteworthy is, we mounted the gluster volume onto a machine, created a dummy file and we saw it replicate across all of the other bricks including the one with the below error. We then deleted the file and watched the changes propagate through the remaining bricks as well. It's as though everything is working fine however the status is returning a false negative.
Please see log files and requested outputs below.
The exact command to reproduce the issue:
gluster volume heal roi-prod-files info
The full output of the command that failed:
Brick virtual-machine-71.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0
Brick virtual-machine-73.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Transport endpoint is not connected
Number of entries: -
Brick virtual-machine-72.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0
Brick virtual-machine-78.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0
Brick virtual-machine-80.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0
Brick virtual-machine-79.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0
Expected results: Status: Connected
Mandatory info:
- The output of the gluster volume info
command:
root@virtual-machine-71 # gluster volume info roi-prod-files
Volume Name: roi-prod-files
Type: Replicate
Volume ID: 219ac7a4-3aea-4aa4-aa62-b67abfd2c6fb
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 6 = 6
Transport-type: tcp
Bricks:
Brick1: virtual-machine-71.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Brick2: virtual-machine-73.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Brick3: virtual-machine-72.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Brick4: virtual-machine-78.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Brick5: virtual-machine-80.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Brick6: virtual-machine-79.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Options Reconfigured:
cluster.use-anonymous-inode: no
nfs.disable: on
transport.address-family: inet
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
auth.allow: 10.72.62.*,10.72.63.*,10.88.62.*,10.88.63.*
storage.owner-gid: 4659
storage.owner-uid: 4659
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
- The output of the gluster volume status
command:
root@virtual-machine-71 # gluster volume status roi-prod-files
Status of volume: roi-prod-files
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick virtual-machine-71.vdc.com:/var/lib/gluster/.
bricks/roi-prod-files 49174 0 Y 59546
Brick virtual-machine-73.vdc.com:/var/lib/gluster/.
bricks/roi-prod-files 49171 0 Y 3116
Brick virtual-machine-72.vdc.com:/var/lib/gluster/.
bricks/roi-prod-files 49166 0 Y 25366
Brick virtual-machine-78.vdc.com:/var/lib/gluster/.
bricks/roi-prod-files 49174 0 Y 46469
Brick virtual-machine-80.vdc.com:/var/lib/gluster/.
bricks/roi-prod-files 49174 0 Y 63678
Brick virtual-machine-79.vdc.com:/var/lib/gluster/.
bricks/roi-prod-files 49174 0 Y 41239
Self-heal Daemon on localhost N/A N/A Y 45375
Quota Daemon on localhost N/A N/A Y 45349
Self-heal Daemon on virtual-machine-73.vdc.com N/A N/A Y 3202
Quota Daemon on virtual-machine-73.vdc.com N/A N/A Y 3182
Self-heal Daemon on virtual-machine-72.vdc.com N/A N/A Y 41973
Quota Daemon on virtual-machine-72.vdc.com N/A N/A Y 41942
Self-heal Daemon on virtual-machine-79.vdc.com N/A N/A Y 55614
Quota Daemon on virtual-machine-79.vdc.com N/A N/A Y 55596
Self-heal Daemon on virtual-machine-78.vdc.com N/A N/A Y 32658
Quota Daemon on virtual-machine-78.vdc.com N/A N/A Y 32642
Self-heal Daemon on virtual-machine-80.vdc.com N/A N/A Y 2796
Quota Daemon on virtual-machine-80.vdc.com N/A N/A Y 2774
Task Status of Volume roi-prod-files
------------------------------------------------------------------------------
There are no active volume tasks
- The output of the gluster volume heal
command:
root@virtual-machine-71 # gluster volume heal roi-prod-files
Launching heal operation to perform index self heal on volume roi-prod-files has been successful
Use heal info commands to check status.
root@virtual-machine-71 # gluster volume heal roi-prod-files info
Brick virtual-machine-71.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0
Brick virtual-machine-73.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Transport endpoint is not connected
Number of entries: -
Brick virtual-machine-72.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0
Brick virtual-machine-78.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0
Brick virtual-machine-80.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0
Brick virtual-machine-79.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0
**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/
logfile: glfsheal-roi-prod-files.log
[2021-08-11 12:07:29.549565 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-roi-prod-files-client-1: changing port to 49174 (from 0)
[2021-08-11 12:07:29.549586 +0000] I [socket.c:849:__socket_shutdown] 0-roi-prod-files-client-1: intentional socket shutdown(10)
[2021-08-11 12:07:29.549828 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-roi-prod-files-client-2: changing port to 49166 (from 0)
[2021-08-11 12:07:29.549844 +0000] I [socket.c:849:__socket_shutdown] 0-roi-prod-files-client-2: intentional socket shutdown(11)
[2021-08-11 12:07:29.550033 +0000] I [MSGID: 114046] [client-handshake.c:857:client_setvolume_cbk] 0-roi-prod-files-client-0: Connected, attached to remote volume [{conn-name=roi-prod-files-client-0}, {remote_subvol=/var/lib/gluster/.bricks/roi-prod-files}]
[2021-08-11 12:07:29.550059 +0000] I [MSGID: 108005] [afr-common.c:6065:__afr_handle_child_up_event] 0-roi-prod-files-replicate-0: Subvolume 'roi-prod-files-client-0' came back up; going online.
[2021-08-11 12:07:29.550840 +0000] I [MSGID: 114057] [client-handshake.c:1128:select_server_supported_programs] 0-roi-prod-files-client-1: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}]
[2021-08-11 12:07:29.550939 +0000] I [MSGID: 114057] [client-handshake.c:1128:select_server_supported_programs] 0-roi-prod-files-client-2: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}]
[2021-08-11 12:07:29.551391 +0000] W [MSGID: 114043] [client-handshake.c:727:client_setvolume_cbk] 0-roi-prod-files-client-1: failed to set the volume [{errno=2}, {error=No such file or directory}]
[2021-08-11 12:07:29.551414 +0000] W [MSGID: 114007] [client-handshake.c:752:client_setvolume_cbk] 0-roi-prod-files-client-1: failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid argument}]
[2021-08-11 12:07:29.551428 +0000] E [MSGID: 114044] [client-handshake.c:757:client_setvolume_cbk] 0-roi-prod-files-client-1: SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2}, {error=No such file or directory}]
[2021-08-11 12:07:29.551442 +0000] I [MSGID: 114051] [client-handshake.c:879:client_setvolume_cbk] 0-roi-prod-files-client-1: sending CHILD_CONNECTING event []
[2021-08-11 12:07:29.551483 +0000] I [MSGID: 114018] [client.c:2229:client_rpc_notify] 0-roi-prod-files-client-1: disconnected from client, process will keep trying to connect glusterd until brick's port is available [{conn-name=roi-prod-files-client-1}]
[2021-08-11 12:07:29.551698 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-roi-prod-files-client-3: changing port to 49174 (from 0)
[2021-08-11 12:07:29.551723 +0000] I [socket.c:849:__socket_shutdown] 0-roi-prod-files-client-3: intentional socket shutdown(12)
[2021-08-11 12:07:29.551741 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-roi-prod-files-client-4: changing port to 49174 (from 0)
[2021-08-11 12:07:29.551754 +0000] I [socket.c:849:__socket_shutdown] 0-roi-prod-files-client-4: intentional socket shutdown(13)
[2021-08-11 12:07:29.551899 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-roi-prod-files-client-5: changing port to 49174 (from 0)
[2021-08-11 12:07:29.551914 +0000] I [socket.c:849:__socket_shutdown] 0-roi-prod-files-client-5: intentional socket shutdown(15)
[2021-08-11 12:07:29.552360 +0000] I [MSGID: 114046] [client-handshake.c:857:client_setvolume_cbk] 0-roi-prod-files-client-2: Connected, attached to remote volume [{conn-name=roi-prod-files-client-2}, {remote_subvol=/var/lib/gluster/.bricks/roi-prod-files}]
[2021-08-11 12:07:29.553656 +0000] I [MSGID: 114057] [client-handshake.c:1128:select_server_supported_programs] 0-roi-prod-files-client-3: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}]
[2021-08-11 12:07:29.553845 +0000] I [MSGID: 114057] [client-handshake.c:1128:select_server_supported_programs] 0-roi-prod-files-client-4: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}]
[2021-08-11 12:07:29.553967 +0000] I [MSGID: 114057] [client-handshake.c:1128:select_server_supported_programs] 0-roi-prod-files-client-5: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}]
[2021-08-11 12:07:29.555366 +0000] I [MSGID: 114046] [client-handshake.c:857:client_setvolume_cbk] 0-roi-prod-files-client-5: Connected, attached to remote volume [{conn-name=roi-prod-files-client-5}, {remote_subvol=/var/lib/gluster/.bricks/roi-prod-files}]
[2021-08-11 12:07:29.555412 +0000] I [MSGID: 108002] [afr-common.c:6435:afr_notify] 0-roi-prod-files-replicate-0: Client-quorum is met
[2021-08-11 12:07:29.555660 +0000] I [MSGID: 114046] [client-handshake.c:857:client_setvolume_cbk] 0-roi-prod-files-client-3: Connected, attached to remote volume [{conn-name=roi-prod-files-client-3}, {remote_subvol=/var/lib/gluster/.bricks/roi-prod-files}]
[2021-08-11 12:07:29.555854 +0000] I [MSGID: 114046] [client-handshake.c:857:client_setvolume_cbk] 0-roi-prod-files-client-4: Connected, attached to remote volume [{conn-name=roi-prod-files-client-4}, {remote_subvol=/var/lib/gluster/.bricks/roi-prod-files}]
[2021-08-11 12:07:29.559779 +0000] I [MSGID: 108031] [afr-common.c:3203:afr_local_discovery_cbk] 0-roi-prod-files-replicate-0: selecting local read_child roi-prod-files-client-0
[2021-08-11 12:07:29.560999 +0000] I [MSGID: 104041] [glfs-resolve.c:974:__glfs_active_subvol] 0-roi-prod-files: switched to graph [{subvol=766d3031-3932-3731-2d36-313634342d32}, {id=0}]
[2021-08-11 12:07:29.562450 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:911:client4_0_getxattr_cbk] 0-roi-prod-files-client-1: remote operation failed. [{path=/}, {gfid=00000000-0000-0000-0000-000000000001}, {key=glusterfs.xattrop_index_gfid}, {errno=107}, {error=Transport endpoint is not connected}]
[2021-08-11 12:07:29.562474 +0000] W [MSGID: 114029] [client-rpc-fops_v2.c:4442:client4_0_getxattr] 0-roi-prod-files-client-1: failed to send the fop []
**- Is there any crash ? Provide the backtrace and coredump
Additional info:
N/A
- The operating system / glusterfs version:
root@virtual-machine-71 # glusterd --version
glusterfs 9.3
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.
root@virtual-machine-71 # glusterfs --version
glusterfs 9.3
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.
root@virtual-machine-71 # cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.9 (Maipo)
Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration
[2021-08-11 12:07:29.549565 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-roi-prod-files-client-1: changing port to 49174 (from 0)
The log is supposed to say 49171 as per the volume status output. @nik-redhat Do you know if anything in glusterd could affect this?
We recently fixed a variant of this bug in https://github.com/gluster/glusterfs/issues/2480, @mmn01-sky how did you get into this state?
Hi @pranithk #2480 could be it.
We had some issues this morning with our gluster cluster which caused us to restart the glusterd service on all our nodes. We then spotted this issue on a different volume, performed a reboot of the node and saw the issue just move to a new volume.
@Adam2Marsh I think it is better to check netstat -anlp
on each brick machine and see if the port information of the brick matches volume status output. If it doesn't then, bring that brick down, bring it back up using volume start force until the port numbers are shown correctly. Please make sure to heal the files before bringing the next brick down.
PS: I have never seen anyone use 6 bricks for replication either.
@Adam2Marsh Now that I think about it, bring both glusterfsd, glusterd down in that order.
bring glusterd up and then do gluster volume start
@Adam2Marsh Now that I think about it, bring both glusterfsd, glusterd down in that order. bring glusterd up and then do gluster volume start force. Wait for heal to complete. This way the bug won't repeat. Maybe you should do it at off hours so that heal queue is not that big.
This is per machine by the way.
Hi @pranithk
Sorry for delay on this, just to say we followed the above process but do occasionally see this issue on different volumes. Normally it happens after a brick in a volume goes down; after we start it up we may see this error but give it 24hrs and it will all look good again.
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.