glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

Gluster volume brick status: "Transport endpoint is not connected"

Open mmn01-sky opened this issue 3 years ago • 7 comments

Description of problem: When checking the gluster volume heal info command on multiple volumes and multiple bricks, we are presented with the following error on some bricks:

Brick virtual-machine-73.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Transport endpoint is not connected
Number of entries: -

What is noteworthy is, we mounted the gluster volume onto a machine, created a dummy file and we saw it replicate across all of the other bricks including the one with the below error. We then deleted the file and watched the changes propagate through the remaining bricks as well. It's as though everything is working fine however the status is returning a false negative.

Please see log files and requested outputs below.

The exact command to reproduce the issue: gluster volume heal roi-prod-files info

The full output of the command that failed:

Brick virtual-machine-71.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0

Brick virtual-machine-73.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Transport endpoint is not connected
Number of entries: -

Brick virtual-machine-72.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0

Brick virtual-machine-78.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0

Brick virtual-machine-80.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0

Brick virtual-machine-79.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0

Expected results: Status: Connected

Mandatory info: - The output of the gluster volume info command:

root@virtual-machine-71 # gluster volume info roi-prod-files
Volume Name: roi-prod-files
Type: Replicate
Volume ID: 219ac7a4-3aea-4aa4-aa62-b67abfd2c6fb
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 6 = 6
Transport-type: tcp
Bricks:
Brick1: virtual-machine-71.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Brick2: virtual-machine-73.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Brick3: virtual-machine-72.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Brick4: virtual-machine-78.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Brick5: virtual-machine-80.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Brick6: virtual-machine-79.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Options Reconfigured:
cluster.use-anonymous-inode: no
nfs.disable: on
transport.address-family: inet
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
auth.allow: 10.72.62.*,10.72.63.*,10.88.62.*,10.88.63.*
storage.owner-gid: 4659
storage.owner-uid: 4659
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on

- The output of the gluster volume status command:

root@virtual-machine-71 # gluster volume status roi-prod-files
Status of volume: roi-prod-files
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick virtual-machine-71.vdc.com:/var/lib/gluster/.
bricks/roi-prod-files        49174     0          Y       59546
Brick virtual-machine-73.vdc.com:/var/lib/gluster/.
bricks/roi-prod-files        49171     0          Y       3116
Brick virtual-machine-72.vdc.com:/var/lib/gluster/.
bricks/roi-prod-files        49166     0          Y       25366
Brick virtual-machine-78.vdc.com:/var/lib/gluster/.
bricks/roi-prod-files        49174     0          Y       46469
Brick virtual-machine-80.vdc.com:/var/lib/gluster/.
bricks/roi-prod-files        49174     0          Y       63678
Brick virtual-machine-79.vdc.com:/var/lib/gluster/.
bricks/roi-prod-files        49174     0          Y       41239
Self-heal Daemon on localhost               N/A       N/A        Y       45375
Quota Daemon on localhost                   N/A       N/A        Y       45349
Self-heal Daemon on virtual-machine-73.vdc.com      N/A       N/A        Y       3202
Quota Daemon on virtual-machine-73.vdc.com          N/A       N/A        Y       3182
Self-heal Daemon on virtual-machine-72.vdc.com      N/A       N/A        Y       41973
Quota Daemon on virtual-machine-72.vdc.com          N/A       N/A        Y       41942
Self-heal Daemon on virtual-machine-79.vdc.com      N/A       N/A        Y       55614
Quota Daemon on virtual-machine-79.vdc.com          N/A       N/A        Y       55596
Self-heal Daemon on virtual-machine-78.vdc.com      N/A       N/A        Y       32658
Quota Daemon on virtual-machine-78.vdc.com          N/A       N/A        Y       32642
Self-heal Daemon on virtual-machine-80.vdc.com      N/A       N/A        Y       2796
Quota Daemon on virtual-machine-80.vdc.com          N/A       N/A        Y       2774

Task Status of Volume roi-prod-files
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command:

root@virtual-machine-71 # gluster volume heal roi-prod-files
Launching heal operation to perform index self heal on volume roi-prod-files has been successful
Use heal info commands to check status.


root@virtual-machine-71 # gluster volume heal roi-prod-files info
Brick virtual-machine-71.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0

Brick virtual-machine-73.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Transport endpoint is not connected
Number of entries: -

Brick virtual-machine-72.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0

Brick virtual-machine-78.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0

Brick virtual-machine-80.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0

Brick virtual-machine-79.vdc.com:/var/lib/gluster/.bricks/roi-prod-files
Status: Connected
Number of entries: 0

**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

logfile: glfsheal-roi-prod-files.log

[2021-08-11 12:07:29.549565 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-roi-prod-files-client-1: changing port to 49174 (from 0)
[2021-08-11 12:07:29.549586 +0000] I [socket.c:849:__socket_shutdown] 0-roi-prod-files-client-1: intentional socket shutdown(10)
[2021-08-11 12:07:29.549828 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-roi-prod-files-client-2: changing port to 49166 (from 0)
[2021-08-11 12:07:29.549844 +0000] I [socket.c:849:__socket_shutdown] 0-roi-prod-files-client-2: intentional socket shutdown(11)
[2021-08-11 12:07:29.550033 +0000] I [MSGID: 114046] [client-handshake.c:857:client_setvolume_cbk] 0-roi-prod-files-client-0: Connected, attached to remote volume [{conn-name=roi-prod-files-client-0}, {remote_subvol=/var/lib/gluster/.bricks/roi-prod-files}]
[2021-08-11 12:07:29.550059 +0000] I [MSGID: 108005] [afr-common.c:6065:__afr_handle_child_up_event] 0-roi-prod-files-replicate-0: Subvolume 'roi-prod-files-client-0' came back up; going online.
[2021-08-11 12:07:29.550840 +0000] I [MSGID: 114057] [client-handshake.c:1128:select_server_supported_programs] 0-roi-prod-files-client-1: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}]
[2021-08-11 12:07:29.550939 +0000] I [MSGID: 114057] [client-handshake.c:1128:select_server_supported_programs] 0-roi-prod-files-client-2: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}]
[2021-08-11 12:07:29.551391 +0000] W [MSGID: 114043] [client-handshake.c:727:client_setvolume_cbk] 0-roi-prod-files-client-1: failed to set the volume [{errno=2}, {error=No such file or directory}]
[2021-08-11 12:07:29.551414 +0000] W [MSGID: 114007] [client-handshake.c:752:client_setvolume_cbk] 0-roi-prod-files-client-1: failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid argument}]
[2021-08-11 12:07:29.551428 +0000] E [MSGID: 114044] [client-handshake.c:757:client_setvolume_cbk] 0-roi-prod-files-client-1: SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2}, {error=No such file or directory}]
[2021-08-11 12:07:29.551442 +0000] I [MSGID: 114051] [client-handshake.c:879:client_setvolume_cbk] 0-roi-prod-files-client-1: sending CHILD_CONNECTING event []
[2021-08-11 12:07:29.551483 +0000] I [MSGID: 114018] [client.c:2229:client_rpc_notify] 0-roi-prod-files-client-1: disconnected from client, process will keep trying to connect glusterd until brick's port is available [{conn-name=roi-prod-files-client-1}]
[2021-08-11 12:07:29.551698 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-roi-prod-files-client-3: changing port to 49174 (from 0)
[2021-08-11 12:07:29.551723 +0000] I [socket.c:849:__socket_shutdown] 0-roi-prod-files-client-3: intentional socket shutdown(12)
[2021-08-11 12:07:29.551741 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-roi-prod-files-client-4: changing port to 49174 (from 0)
[2021-08-11 12:07:29.551754 +0000] I [socket.c:849:__socket_shutdown] 0-roi-prod-files-client-4: intentional socket shutdown(13)
[2021-08-11 12:07:29.551899 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-roi-prod-files-client-5: changing port to 49174 (from 0)
[2021-08-11 12:07:29.551914 +0000] I [socket.c:849:__socket_shutdown] 0-roi-prod-files-client-5: intentional socket shutdown(15)
[2021-08-11 12:07:29.552360 +0000] I [MSGID: 114046] [client-handshake.c:857:client_setvolume_cbk] 0-roi-prod-files-client-2: Connected, attached to remote volume [{conn-name=roi-prod-files-client-2}, {remote_subvol=/var/lib/gluster/.bricks/roi-prod-files}]
[2021-08-11 12:07:29.553656 +0000] I [MSGID: 114057] [client-handshake.c:1128:select_server_supported_programs] 0-roi-prod-files-client-3: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}]
[2021-08-11 12:07:29.553845 +0000] I [MSGID: 114057] [client-handshake.c:1128:select_server_supported_programs] 0-roi-prod-files-client-4: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}]
[2021-08-11 12:07:29.553967 +0000] I [MSGID: 114057] [client-handshake.c:1128:select_server_supported_programs] 0-roi-prod-files-client-5: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}]
[2021-08-11 12:07:29.555366 +0000] I [MSGID: 114046] [client-handshake.c:857:client_setvolume_cbk] 0-roi-prod-files-client-5: Connected, attached to remote volume [{conn-name=roi-prod-files-client-5}, {remote_subvol=/var/lib/gluster/.bricks/roi-prod-files}]
[2021-08-11 12:07:29.555412 +0000] I [MSGID: 108002] [afr-common.c:6435:afr_notify] 0-roi-prod-files-replicate-0: Client-quorum is met
[2021-08-11 12:07:29.555660 +0000] I [MSGID: 114046] [client-handshake.c:857:client_setvolume_cbk] 0-roi-prod-files-client-3: Connected, attached to remote volume [{conn-name=roi-prod-files-client-3}, {remote_subvol=/var/lib/gluster/.bricks/roi-prod-files}]
[2021-08-11 12:07:29.555854 +0000] I [MSGID: 114046] [client-handshake.c:857:client_setvolume_cbk] 0-roi-prod-files-client-4: Connected, attached to remote volume [{conn-name=roi-prod-files-client-4}, {remote_subvol=/var/lib/gluster/.bricks/roi-prod-files}]
[2021-08-11 12:07:29.559779 +0000] I [MSGID: 108031] [afr-common.c:3203:afr_local_discovery_cbk] 0-roi-prod-files-replicate-0: selecting local read_child roi-prod-files-client-0
[2021-08-11 12:07:29.560999 +0000] I [MSGID: 104041] [glfs-resolve.c:974:__glfs_active_subvol] 0-roi-prod-files: switched to graph [{subvol=766d3031-3932-3731-2d36-313634342d32}, {id=0}]
[2021-08-11 12:07:29.562450 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:911:client4_0_getxattr_cbk] 0-roi-prod-files-client-1: remote operation failed. [{path=/}, {gfid=00000000-0000-0000-0000-000000000001}, {key=glusterfs.xattrop_index_gfid}, {errno=107}, {error=Transport endpoint is not connected}]
[2021-08-11 12:07:29.562474 +0000] W [MSGID: 114029] [client-rpc-fops_v2.c:4442:client4_0_getxattr] 0-roi-prod-files-client-1: failed to send the fop []

**- Is there any crash ? Provide the backtrace and coredump

Additional info:

N/A

- The operating system / glusterfs version:

root@virtual-machine-71 # glusterd --version
glusterfs 9.3
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.


root@virtual-machine-71 # glusterfs --version
glusterfs 9.3
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.

root@virtual-machine-71 # cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.9 (Maipo)

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

mmn01-sky avatar Aug 11 '21 12:08 mmn01-sky

[2021-08-11 12:07:29.549565 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-roi-prod-files-client-1: changing port to 49174 (from 0)

The log is supposed to say 49171 as per the volume status output. @nik-redhat Do you know if anything in glusterd could affect this?

We recently fixed a variant of this bug in https://github.com/gluster/glusterfs/issues/2480, @mmn01-sky how did you get into this state?

pranithk avatar Aug 11 '21 13:08 pranithk

Hi @pranithk #2480 could be it.

We had some issues this morning with our gluster cluster which caused us to restart the glusterd service on all our nodes. We then spotted this issue on a different volume, performed a reboot of the node and saw the issue just move to a new volume.

Adam2Marsh avatar Aug 11 '21 13:08 Adam2Marsh

@Adam2Marsh I think it is better to check netstat -anlp on each brick machine and see if the port information of the brick matches volume status output. If it doesn't then, bring that brick down, bring it back up using volume start force until the port numbers are shown correctly. Please make sure to heal the files before bringing the next brick down.

PS: I have never seen anyone use 6 bricks for replication either.

pranithk avatar Aug 11 '21 13:08 pranithk

@Adam2Marsh Now that I think about it, bring both glusterfsd, glusterd down in that order. bring glusterd up and then do gluster volume start force. Wait for heal to complete. This way the bug won't repeat. Maybe you should do it at off hours so that heal queue is not that big.

pranithk avatar Aug 11 '21 13:08 pranithk

@Adam2Marsh Now that I think about it, bring both glusterfsd, glusterd down in that order. bring glusterd up and then do gluster volume start force. Wait for heal to complete. This way the bug won't repeat. Maybe you should do it at off hours so that heal queue is not that big.

This is per machine by the way.

pranithk avatar Aug 11 '21 13:08 pranithk

Hi @pranithk

Sorry for delay on this, just to say we followed the above process but do occasionally see this issue on different volumes. Normally it happens after a brick in a volume goes down; after we start it up we may see this error but give it 24hrs and it will all look good again.

Adam2Marsh avatar Nov 25 '21 15:11 Adam2Marsh

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar Jul 10 '22 06:07 stale[bot]

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

stale[bot] avatar Nov 01 '22 21:11 stale[bot]