glusterfs Gluster 11.0 brick crash

Description of problem:

In a 3 replica cluster under heavy write load, one of the bricks become offline.

Mandatory info: - The output of the gluster volume info command:


Volume Name: share
Type: Distributed-Replicate
Volume ID: 08d4902f-5f00-43eb-b068-4e350b67706b
Status: Started
Snapshot Count: 3
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: cu-glstr-01-cl1:/data/glusterfs/share/brick1/brick
Brick2: cu-glstr-02-cl1:/data/glusterfs/share/brick1/brick
Brick3: cu-glstr-03-cl1:/data/glusterfs/share/brick1/brick
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.cache-samba-metadata: on
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 200000
performance.nl-cache: on
performance.nl-cache-timeout: 600
performance.readdir-ahead: on
performance.parallel-readdir: on
performance.write-behind: off
performance.cache-size: 1GB
performance.cache-max-file-size: 1MB
features.barrier: disable

- The output of the gluster volume status command:

gluster v status share
Status of volume: share
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick cu-glstr-01-cl1:/data/glusterfs/share
/brick1/brick                               58984     0          Y       1183
Brick cu-glstr-02-cl1:/data/glusterfs/share
/brick1/brick                               60553     0          Y       1176
Brick cu-glstr-03-cl1:/data/glusterfs/share
/brick1/brick                               59748     0          N       9643
Self-heal Daemon on localhost               N/A       N/A        Y       1219
Self-heal Daemon on cu-glstr-03-cl1         N/A       N/A        Y       9679
Self-heal Daemon on cu-glstr-02-cl1         N/A       N/A        Y       1216

- The output of the gluster volume heal command:

gluster v heal share info
Brick cu-glstr-01-cl1:/data/glusterfs/share/brick1/brick
/a7f2g/MessagePreviews/132227_268x321.jpg
/a7f2g/MessagePreviews
/a7f2g/MessagePreviews/132227_90x110.jpg
/a7f2g/MessagePreviews/132228_268x321.jpg
/a7f2g/MessagePreviews/132228_90x110.jpg
/a7f2g/MessagePreviews/132229_268x321.jpg
/a7f2g/MessagePreviews/132229_90x110.jpg
Status: Connected
Number of entries: 7

Brick cu-glstr-02-cl1:/data/glusterfs/share/brick1/brick
/a7f2g/MessagePreviews/132227_268x321.jpg
/a7f2g/MessagePreviews
/a7f2g/MessagePreviews/132227_90x110.jpg
/a7f2g/MessagePreviews/132228_268x321.jpg
/a7f2g/MessagePreviews/132228_90x110.jpg
/a7f2g/MessagePreviews/132229_268x321.jpg
/a7f2g/MessagePreviews/132229_90x110.jpg
Status: Connected
Number of entries: 7

Brick cu-glstr-03-cl1:/data/glusterfs/share/brick1/brick
Status: Connected
Number of entries: 0

**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

On client side I have al lot of:

[2023-03-29 10:46:41.473191 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:712:client4_0_writev_cbk] 0-share-client-2: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2023-03-29 10:46:41.473349 +0000] E [rpc-clnt.c:313:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7fc2277382b9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x742e)[0x7fc2276d542e] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x111)[0x7fc2276dc581] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0xf480)[0x7fc2276dd480] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7fc2276d858a] ))))) 0-share-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(GETXATTR(18)) called at 2023-03-29 10:45:41 +0000 (xid=0xc98c59)
[2023-03-29 10:46:41.473366 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:925:client4_0_getxattr_cbk] 0-share-client-2: remote operation failed. [{path=/i1b0/images/5}, {gfid=591c4688-0df7-484f-8395-3494fe62a5aa}, {key=glusterfs.get_real_filename:01_foto carbonara_ridotta.jpg}, {errno=107}, {error=Transport endpoint is not connected}]
[2023-03-29 10:46:41.473420 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-share-client-2: remote operation failed. [{path=/i1b0/images/5}, {gfid=591c4688-0df7-484f-8395-3494fe62a5aa}, {errno=107}, {error=Transport endpoint is not connected}]
[2023-03-29 10:46:41.473429 +0000] W [MSGID: 114029] [client-rpc-fops_v2.c:2991:client4_0_lookup] 0-share-client-2: failed to send the fop []
[2023-03-29 10:46:41.473505 +0000] E [rpc-clnt.c:313:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7fc2277382b9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x742e)[0x7fc2276d542e] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x111)[0x7fc2276dc581] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0xf480)[0x7fc2276dd480] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7fc2276d858a] ))))) 0-share-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2023-03-29 10:45:42 +0000 (xid=0xc98c5a)
[2023-03-29 10:46:41.473516 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-share-client-2: remote operation failed. [{path=/}, {gfid=00000000-0000-0000-0000-000000000001}, {errno=107}, {error=Transport endpoint is not connected}]
[2023-03-29 10:46:41.473649 +0000] E [rpc-clnt.c:313:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7fc2277382b9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x742e)[0x7fc2276d542e] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x111)[0x7fc2276dc581] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0xf480)[0x7fc2276dd480] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7fc2276d858a] ))))) 0-share-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2023-03-29 10:45:42 +0000 (xid=0xc98c5b)
[2023-03-29 10:46:41.473835 +0000] E [rpc-clnt.c:313:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7fc2277382b9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x742e)[0x7fc2276d542e] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x111)[0x7fc2276dc581] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0xf480)[0x7fc2276dd480] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7fc2276d858a] ))))) 0-share-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2023-03-29 10:45:44 +0000 (xid=0xc98c5c)

glusterd.log on the failed brick:

[2023-03-29 10:40:40.373980 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume share
[2023-03-29 10:40:45.204118 +0000] I [MSGID: 106487] [glusterd-handler.c:1452:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2023-03-29 10:45:40.850214 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume share
[2023-03-29 10:45:46.776439 +0000] I [MSGID: 106487] [glusterd-handler.c:1452:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2023-03-29 10:47:51.671709 +0000] I [MSGID: 106143] [glusterd-pmap.c:353:pmap_port_remove] 0-pmap: removing brick (null) on port 53885
[2023-03-29 10:47:51.691544 +0000] I [MSGID: 106005] [glusterd-handler.c:6419:__glusterd_brick_rpc_notify] 0-management: Brick cu-glstr-03-cl1:/data/glusterfs/share/brick1/brick has disconnected from glusterd.
[2023-03-29 10:47:51.692042 +0000] I [MSGID: 106143] [glusterd-pmap.c:353:pmap_port_remove] 0-pmap: removing brick /data/glusterfs/share/brick1/brick on port 53885
[2023-03-29 10:50:41.903172 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume share
[2023-03-29 10:50:47.412646 +0000] I [MSGID: 106487] [glusterd-handler.c:1452:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2023-03-29 10:55:43.525353 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume share

**- Is there any crash ? Provide the backtrace and coredump

Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: pending frames:
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: frame : type(1) op(WRITE)
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: patchset: git://git.gluster.org/glusterfs.git
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: signal received: 11
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: time of crash:
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: 2023-03-29 10:45:40 +0000
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: configuration details:
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: argp 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: backtrace 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: dlfcn 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: libpthread 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: llistxattr 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: setfsid 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: epoll.h 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: xattr.h 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: st_atim.tv_nsec 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: package-string: glusterfs 11.0
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: ---------

On the failed brick log:

[2023-03-29 10:45:38.868149 +0000] I [posix-entry-ops.c:382:posix_lookup] 0-share-posix: <gfid:c71d188b-ce42-4854-a65e-96a060885c29>/1730/TERRA SANTA PARTENZA
 CONFERMATA_page-0001(0).jpg: inode path not completely resolved. Asking for full path
[2023-03-29 10:45:40.128697 +0000] I [posix-entry-ops.c:382:posix_lookup] 0-share-posix: <gfid:c71d188b-ce42-4854-a65e-96a060885c29>/294/zanzibar300x300.jpg:
inode path not completely resolved. Asking for full path
[2023-03-29 10:45:40.857960 +0000] I [addr.c:52:compare_addr_and_update] 0-/data/glusterfs/share/brick1/brick: allowed = "*", received addr = "192.168.56.112"
[2023-03-29 10:45:40.857981 +0000] I [login.c:109:gf_auth] 0-auth/login: allowed user names: ad7fcb45-86cc-451d-96e9-a9a718f2eeea
[2023-03-29 10:45:40.857988 +0000] I [MSGID: 115029] [server-handshake.c:645:server_setvolume] 0-share-server: accepted client from CTX_ID:5b198cc4-89c1-4ad7-
a28d-6cbb0274f9c2-GRAPH_ID:0-PID:9038-HOST:cu-glstr-03-cl1-PC_NAME:share-client-2-RECON_NO:-0 (version: 11.0) with subvol /data/glusterfs/share/brick1/brick
[2023-03-29 10:45:40.889198 +0000] W [socket.c:751:__socket_rwv] 0-tcp.share-server: readv on 192.168.56.112:49146 failed (No data available)
[2023-03-29 10:45:40.889240 +0000] I [MSGID: 115036] [server.c:494:server_rpc_notify] 0-share-server: disconnecting connection [{client-uid=CTX_ID:5b198cc4-89
c1-4ad7-a28d-6cbb0274f9c2-GRAPH_ID:0-PID:9038-HOST:cu-glstr-03-cl1-PC_NAME:share-client-2-RECON_NO:-0}]
[2023-03-29 10:45:40.889391 +0000] I [MSGID: 101054] [client_t.c:374:gf_client_unref] 0-share-server: Shutting down connection CTX_ID:5b198cc4-89c1-4ad7-a28d-
6cbb0274f9c2-GRAPH_ID:0-PID:9038-HOST:cu-glstr-03-cl1-PC_NAME:share-client-2-RECON_NO:-0
[2023-03-29 10:45:40.889396 +0000] I [socket.c:3653:socket_submit_outgoing_msg] 0-tcp.share-server: not connected (priv->connected = -1)
[2023-03-29 10:45:40.889433 +0000] W [rpcsvc.c:1322:rpcsvc_callback_submit] 0-rpcsvc: transmission of rpc-request failed
pending frames:
frame : type(1) op(WRITE)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2023-03-29 10:45:40 +0000
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 11.0
/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x25954)[0x7fe2c416b954]
/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x698)[0x7fe2c41752f8]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe2c3f17520]
/lib/x86_64-linux-gnu/libglusterfs.so.0(__gf_free+0x69)[0x7fe2c418be59]
/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_unref+0x9e)[0x7fe2c411642e]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/protocol/server.so(+0xb0a6)[0x7fe2c01390a6]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/protocol/server.so(+0xb9a4)[0x7fe2c01399a4]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/debug/io-stats.so(+0x1a158)[0x7fe2c01ca158]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/quota.so(+0x12d42)[0x7fe2c01f6d42]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/index.so(+0xab05)[0x7fe2c0218b05]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/barrier.so(+0x7a58)[0x7fe2c022aa58]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/performance/io-threads.so(+0x7801)[0x7fe2c0267801]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0xd9bf)[0x7fe2c027f9bf]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0xdd2b)[0x7fe2c027fd2b]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0x12c0c)[0x7fe2c0284c0c]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0x2bba)[0x7fe2c0274bba]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/leases.so(+0x2c96)[0x7fe2c0293c96]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/locks.so(+0x12d9f)[0x7fe2c02e5d9f]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev_cbk+0x126)[0x7fe2c41db076]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/changelog.so(+0x845e)[0x7fe2c034e45e]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/storage/posix.so(+0x2f3ab)[0x7fe2c03ef3ab]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev+0xdf)[0x7fe2c41e681f]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/changelog.so(+0x10d7d)[0x7fe2c0356d7d]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/bitrot-stub.so(+0xcd02)[0x7fe2c0334d02]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev+0xdf)[0x7fe2c41e681f]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/locks.so(+0x148d0)[0x7fe2c02e78d0]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev+0xdf)[0x7fe2c41e681f]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/worm.so(+0x5ca7)[0x7fe2c02b9ca7]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/read-only.so(+0x4db6)[0x7fe2c02aedb6]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/leases.so(+0x8f5a)[0x7fe2c0299f5a]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0x7533)[0x7fe2c0279533]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev_resume+0x1ee)[0x7fe2c41e31ce]
/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x32b08)[0x7fe2c4178b08]
/lib/x86_64-linux-gnu/libglusterfs.so.0(call_resume+0x6d)[0x7fe2c418579d]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/performance/io-threads.so(+0x6700)[0x7fe2c0266700]
/lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7fe2c3f69b43]
/lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7fe2c3ffba00]
---------
[2023-03-29 10:59:32.127900 +0000] I [MSGID: 100030] [glusterfsd.c:2872:main] 0-/usr/sbin/glusterfsd: Started running version [{arg=/usr/sbin/glusterfsd}, {ve
rsion=11.0}, {cmdlinestr=/usr/sbin/glusterfsd -s cu-glstr-03-cl1 --volfile-id share.cu-glstr-03-cl1.data-glusterfs-share-brick1-brick -p /var/run/gluster/vols
/share/cu-glstr-03-cl1-data-glusterfs-share-brick1-brick.pid -S /var/run/gluster/dbbbf2b10a2790dd.socket --brick-name /data/glusterfs/share/brick1/brick -l /v
ar/log/glusterfs/bricks/data-glusterfs-share-brick1-brick.log --xlator-option *-posix.glusterd-uuid=37914111-9b77-4c72-b86d-a158803aa75f --process-name brick
--brick-port 59748 --xlator-option share-server.listen-port=59748}]
[2023-03-29 10:59:32.128730 +0000] I [glusterfsd.c:2562:daemonize] 0-glusterfs: Pid of current running process is 9643
[2023-03-29 10:59:32.137424 +0000] I [socket.c:916:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 10
[2023-03-29 10:59:32.138888 +0000] I [MSGID: 101188] [event-epoll.c:643:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}]
[2023-03-29 10:59:32.138967 +0000] I [MSGID: 101188] [event-epoll.c:643:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}]
[2023-03-29 10:59:32.157103 +0000] I [glusterfsd-mgmt.c:2336:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: cu-glstr-01-cl1:24007 cu-glstr-02-cl1:24007
[2023-03-29 10:59:32.164857 +0000] I [rpcsvc.c:2708:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
[2023-03-29 10:59:32.165277 +0000] I [io-stats.c:3784:ios_sample_buf_size_configure] 0-/data/glusterfs/share/brick1/brick: Configure ios_sample_buf  size is 1
024 because ios_sample_interval is 0
[2023-03-29 10:59:32.166825 +0000] I [trash.c:2443:init] 0-share-trash: no option specified for 'eliminate', using NULL
[2023-03-29 10:59:32.223505 +0000] I [posix-common.c:371:posix_statfs_path] 0-share-posix: Set disk_size_after reserve is 1874321604608
Final graph:
+------------------------------------------------------------------------------+
  1: volume share-posix
  2:     type storage/posix
  3:     option glusterd-uuid 37914111-9b77-4c72-b86d-a158803aa75f
  4:     option directory /data/glusterfs/share/brick1/brick
  5:     option volume-id 08d4902f-5f00-43eb-b068-4e350b67706b
  6:     option fips-mode-rchecksum on
  7:     option shared-brick-count 1
  8: end-volume
  9:
 10: volume share-trash
 11:     type features/trash
 12:     option trash-dir .trashcan
 13:     option brick-path /data/glusterfs/share/brick1/brick
 14:     option trash-internal-op off
 15:     subvolumes share-posix
 16: end-volume
 17:
 18: volume share-changelog
 19:     type features/changelog
 20:     option changelog-brick /data/glusterfs/share/brick1/brick
 21:     option changelog-dir /data/glusterfs/share/brick1/brick/.glusterfs/changelogs
 22:     option changelog-notification off
 23:     option changelog-barrier-timeout 120
 24:     subvolumes share-trash
 25: end-volume

Additional info:

The restart of glusterd process on the offline bricks recovered the situation. Now is healing the missing files.

- The operating system / glusterfs version:

Ubuntu 22.04 LTS updated

Mar 29 '23 11:03 icolombi

Just happened again with the node cu-glstr-02-cl1. Same load (copying via Samba about 30 GB of data in 31k files).

Mar 29 '23 12:03 icolombi

@icolombi Do you have any core dump we can look at?

Mar 30 '23 08:03 rafikc30

How can I provide a core dump? Thanks

Mar 30 '23 14:03 icolombi

May be you can refer to this article based on Ubuntu 22.04

If you can find the core files, you can install the debug packages and attach the core files and get the backtrace using t a a bt, or the best would be sharing the corefiles and I will take a look at it.

Mar 31 '23 06:03 rafikc30

Thanks @rafikc30, I have the two dump files. Gzipped they are about 85 and 115 Mb, how can I share them with you?

Mar 31 '23 07:03 icolombi

@icolombi I think, The upload limit for attachments to a GitHub issue is currently 25 MB per file, you may want to consider using a file hosting service, such as Dropbox or Google Drive, and providing a link to the file in the GitHub issue.

Mar 31 '23 10:03 rafikc30

Thanks. Does the dump includes sensitive data?

Mar 31 '23 10:03 icolombi

A core file contains in-memory data while a process received the signal that caused the process to crash like SIGSEGV. Mostly, we are interested in the variable values, state of transport, ref count of variables etc. So In general it may contain file names, some metadata information, and I think it is also possible to have some content if a write or read is happening while the core is generated

Mar 31 '23 11:03 rafikc30

Thanks. Here you are:

GDrive

Mar 31 '23 11:03 icolombi

@icolombi I will take a look

Mar 31 '23 11:03 rafikc30

I'm experiencing same issue. I had cluster that was stable as a rock on 10.x, since update to 11.0 one one periodically crashes in similar way to one reported here.

May 03 '23 06:05 Expro

just as info..... on my side the last "stable" version is 10.1... every version after its just crashing after a period of time. something was essentially changed with 10.2

May 10 '23 08:05 madmax01

@rafikc30 Did you find the root cause of this issue? We are facing those too, especially with larger directory trees having millions of files in it. There I can reproduce this issue all the time if you need additional coredump information. The OS is Debian 11 with GlusterFS Version 11. We didn't face those issues in GlusterFS Version 10.x or lower. I tried to compile GlusterFS with the "--enable-debug" parameter for more detailed coredump information too, however with that option enabled the brick unfortunately never crashed. Thanks in advance.

Oct 20 '23 19:10 xImMoRtALitY99

@icolombi Can you please share "thread apply all bt full" output after attach a core with gdb.

Oct 28 '23 12:10 mohit84

Just wanted to add that I'm having the same issue ever since upgrading from Debian Bullseye to Bookworm (glusterfs-server 9.2 -> 10.3). 1 of 4 random gluster server processes seem to crash daily with similar to:

Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: pending frames: Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: patchset: git://git.gluster.org/glusterfs.git Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: signal received: 11 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: time of crash: Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: 2023-10-03 14:10:32 +0000 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: configuration details: Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: argp 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: backtrace 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: dlfcn 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: libpthread 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: llistxattr 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: setfsid 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: epoll.h 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: xattr.h 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: st_atim.tv_nsec 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: package-string: glusterfs 10.3 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: ---------

Nov 01 '23 14:11 LogicalNetworkingSolutions

Yeah, we are experiencing this problem too. On a CentOS9 environment, with GlusterFS version 11.1. Brick log on my side is full with: [2023-12-21 14:14:59.494844 +0000] I [posix-entry-ops.c:382:posix_lookup] 0-avans-posix: gfid:dd768023-41d8-4b18-a26c-782b418818a1/149c5ebf-688b-4543-bcac-6dfb1ab7ebbf: inode path not completely resolved. Asking for full path

Maybe this can be helpful?

Dec 27 '23 21:12 jeroenwichers

Yeah, we are experiencing this problem too. On a CentOS9 environment, with GlusterFS version 11.1. Brick log on my side is full with: [2023-12-21 14:14:59.494844 +0000] I [posix-entry-ops.c:382:posix_lookup] 0-avans-posix: gfid:dd768023-41d8-4b18-a26c-782b418818a1/149c5ebf-688b-4543-bcac-6dfb1ab7ebbf: inode path not completely resolved. Asking for full path

Maybe this can be helpful?

On 11.1 and my brick logs are full of the same "inode path not completely resolved. Asking for full path" errors.

Jan 04 '24 20:01 edrock200

Just to add to this, in digging a bit more, unless I'm misreading the logs, it appears the node the clients connect to to pull volume info appears to be "one version" old. By that I mean: Lets say I have 3 dispersed gluster nodes, each with a brick, and a 2+1 volume mounted to clients. Node 1 - brick 1 - port 1000 Node 2 - brick 2 - port 1001 Node 3 - brick 3 - port 1002

All clients mount by pointing to node 1.

At some point Node 2 crashes or I restart it. When it comes back up, the brick port changes to 2001. The clients still try to connect to 1001. If I kill the brick manually, restart glusterd, and it comes back online with 3001, the clients now try to connect to 2001. It's like it's advertising the port from the previous killed process for some reason.

Not sure if this matters but I do not have brick multiplexing enabled, nor shared storage, but I do have io_uring enabled.

Feb 08 '24 20:02 edrock200

@rafikc30 I'm running into this issue after upgrading from Ubuntu 22.04 with gluster 10.1 to Ubuntu 24.04 with gluster 11.1. I have multiple volumes, but the issue has only been triggered by a volume which backs a minio instance (lots of small file i/o):

Volume Name: minio
Type: Distribute
Volume ID: 1698d653-3c53-4955-b031-656951419885
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: nas-0:/pool-2/vol-0/minio/brick
Options Reconfigured:
diagnostics.brick-log-level: TRACE
performance.io-cache: on
performance.io-cache-size: 1GB
performance.quick-read-cache-timeout: 600
performance.parallel-readdir: on
performance.readdir-ahead: on
network.inode-lru-limit: 200000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
cluster.force-migration: off
performance.client-io-threads: on
cluster.readdir-optimize: on
diagnostics.client-log-level: ERROR
storage.fips-mode-rchecksum: on
transport.address-family: inet

My core dump is 1G due to cache settings and probably contains sensitive data, so I've only attached the brick backtrace and the last 10K lines of a trace-level brick log. Please let me know if there's anything else that would be helpful from the core dump.

backtrace.log brick.log

May 05 '24 00:05 nick-oconnor

I started looking through the 11 commits to inode.c since v10.1. I haven't found anything obvious that would cause inode to be null when passed to __inode_unref yet. Are there any relevant tests for this code?

May 05 '24 02:05 nick-oconnor

I started looking through the 11 commits to inode.c since v10.1. I haven't found anything obvious that would cause inode to be null when passed to __inode_unref yet. Are there any relevant tests for this code?

Could be https://github.com/gluster/glusterfs/commit/da2391dacd3483555e91a33ecdf89948be62b691

May 05 '24 06:05 mykaul

@mykaul yep, my core dump is exactly what's described in #4295 with ~5K recursive calls to inode_unref. I'll escalate this to the Ubuntu package maintainers and see if they'll patch it in.

May 05 '24 07:05 nick-oconnor

This also affects RHEL 9.4 and is a consistent issue for me; I've gotten to the point where I have to have a watchdog script in cron watching the process to make sure it doesn't die. Let me know if you need any additional troubleshooting info to fix this.

Jun 06 '24 05:06 Trae32566

@Trae32566 that's unfortunate that this bug also affects RHEL. I don't think there's any troubleshooting left to do though as #4302 fixed issue. No release has the fix yet per https://github.com/gluster/glusterfs/issues/4295#issuecomment-2094665030. It's probably worth while adding a comment to that thread mentioning the impact to RHEL. Hopefully someone will cut a new release.

Jun 06 '24 14:06 nick-oconnor

@Trae32566 that's unfortunate that this bug also affects RHEL. I don't think there's any troubleshooting left to do though as #4302 fixed issue. No release has the fix yet per #4295 (comment). It's probably worth while adding a comment to that thread mentioning the impact to RHEL. Hopefully someone will cut a new release.

I don't think it's OS specific. It just needs a Glusterfs release. @gluster/gluster-maintainers can do it.

Jun 06 '24 15:06 mykaul

Yep, this is not OS-specific. The stack overflow is a stack overflow on any OS. However since there's no tagged release which contains the fix, every distro is gradually picking up bugged versions.

Jun 06 '24 17:06 nick-oconnor

I am using Gluster version 11.0 on Rocky Linux 8 and frequently experiencing sudden crashes of the same glusterfsd daemon. Currently, I don't have a solution, so I periodically check the daemon with cron and restart it.

When can I expect a fixed update version to be released?

Jun 11 '24 15:06 showinfo

@showinfo yeah I'm not sure what's going on. It seems like Gluster as a project is in a semi-maintained state after being dropped by Red Hat.

If you're feeling enterprising, can generate a patch and rebuild the rpm with something like this but for rpm (vs deb). The process for Rocky appears to be documented here.

Jun 11 '24 17:06 nick-oconnor

I am using Gluster version 11.0 on Rocky Linux 8 and frequently experiencing sudden crashes of the same glusterfsd daemon. Currently, I don't have a solution, so I periodically check the daemon with cron and restart it.

When can I expect a fixed update version to be released?

For reference 11.1 has been out for a while and did address a brick crash bug. Not sure if this is the same one you are experiencing though.

Jun 11 '24 18:06 edrock200

This is going to get interesting. Which PR was it? I'm referring to #4302 which is not part of 11.1.

Jun 11 '24 18:06 nick-oconnor

glusterfs glusterfs copied to clipboard

Gluster 11.0 brick crash

glusterfs
glusterfs copied to clipboard