glusterfs
glusterfs copied to clipboard
Gluster 11.0 brick crash
Description of problem:
In a 3 replica cluster under heavy write load, one of the bricks become offline.
Mandatory info:
- The output of the gluster volume info
command:
Volume Name: share
Type: Distributed-Replicate
Volume ID: 08d4902f-5f00-43eb-b068-4e350b67706b
Status: Started
Snapshot Count: 3
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: cu-glstr-01-cl1:/data/glusterfs/share/brick1/brick
Brick2: cu-glstr-02-cl1:/data/glusterfs/share/brick1/brick
Brick3: cu-glstr-03-cl1:/data/glusterfs/share/brick1/brick
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.cache-samba-metadata: on
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 200000
performance.nl-cache: on
performance.nl-cache-timeout: 600
performance.readdir-ahead: on
performance.parallel-readdir: on
performance.write-behind: off
performance.cache-size: 1GB
performance.cache-max-file-size: 1MB
features.barrier: disable
- The output of the gluster volume status
command:
gluster v status share
Status of volume: share
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick cu-glstr-01-cl1:/data/glusterfs/share
/brick1/brick 58984 0 Y 1183
Brick cu-glstr-02-cl1:/data/glusterfs/share
/brick1/brick 60553 0 Y 1176
Brick cu-glstr-03-cl1:/data/glusterfs/share
/brick1/brick 59748 0 N 9643
Self-heal Daemon on localhost N/A N/A Y 1219
Self-heal Daemon on cu-glstr-03-cl1 N/A N/A Y 9679
Self-heal Daemon on cu-glstr-02-cl1 N/A N/A Y 1216
- The output of the gluster volume heal
command:
gluster v heal share info
Brick cu-glstr-01-cl1:/data/glusterfs/share/brick1/brick
/a7f2g/MessagePreviews/132227_268x321.jpg
/a7f2g/MessagePreviews
/a7f2g/MessagePreviews/132227_90x110.jpg
/a7f2g/MessagePreviews/132228_268x321.jpg
/a7f2g/MessagePreviews/132228_90x110.jpg
/a7f2g/MessagePreviews/132229_268x321.jpg
/a7f2g/MessagePreviews/132229_90x110.jpg
Status: Connected
Number of entries: 7
Brick cu-glstr-02-cl1:/data/glusterfs/share/brick1/brick
/a7f2g/MessagePreviews/132227_268x321.jpg
/a7f2g/MessagePreviews
/a7f2g/MessagePreviews/132227_90x110.jpg
/a7f2g/MessagePreviews/132228_268x321.jpg
/a7f2g/MessagePreviews/132228_90x110.jpg
/a7f2g/MessagePreviews/132229_268x321.jpg
/a7f2g/MessagePreviews/132229_90x110.jpg
Status: Connected
Number of entries: 7
Brick cu-glstr-03-cl1:/data/glusterfs/share/brick1/brick
Status: Connected
Number of entries: 0
**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/
On client side I have al lot of:
[2023-03-29 10:46:41.473191 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:712:client4_0_writev_cbk] 0-share-client-2: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2023-03-29 10:46:41.473349 +0000] E [rpc-clnt.c:313:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7fc2277382b9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x742e)[0x7fc2276d542e] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x111)[0x7fc2276dc581] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0xf480)[0x7fc2276dd480] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7fc2276d858a] ))))) 0-share-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(GETXATTR(18)) called at 2023-03-29 10:45:41 +0000 (xid=0xc98c59)
[2023-03-29 10:46:41.473366 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:925:client4_0_getxattr_cbk] 0-share-client-2: remote operation failed. [{path=/i1b0/images/5}, {gfid=591c4688-0df7-484f-8395-3494fe62a5aa}, {key=glusterfs.get_real_filename:01_foto carbonara_ridotta.jpg}, {errno=107}, {error=Transport endpoint is not connected}]
[2023-03-29 10:46:41.473420 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-share-client-2: remote operation failed. [{path=/i1b0/images/5}, {gfid=591c4688-0df7-484f-8395-3494fe62a5aa}, {errno=107}, {error=Transport endpoint is not connected}]
[2023-03-29 10:46:41.473429 +0000] W [MSGID: 114029] [client-rpc-fops_v2.c:2991:client4_0_lookup] 0-share-client-2: failed to send the fop []
[2023-03-29 10:46:41.473505 +0000] E [rpc-clnt.c:313:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7fc2277382b9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x742e)[0x7fc2276d542e] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x111)[0x7fc2276dc581] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0xf480)[0x7fc2276dd480] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7fc2276d858a] ))))) 0-share-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2023-03-29 10:45:42 +0000 (xid=0xc98c5a)
[2023-03-29 10:46:41.473516 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-share-client-2: remote operation failed. [{path=/}, {gfid=00000000-0000-0000-0000-000000000001}, {errno=107}, {error=Transport endpoint is not connected}]
[2023-03-29 10:46:41.473649 +0000] E [rpc-clnt.c:313:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7fc2277382b9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x742e)[0x7fc2276d542e] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x111)[0x7fc2276dc581] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0xf480)[0x7fc2276dd480] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7fc2276d858a] ))))) 0-share-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2023-03-29 10:45:42 +0000 (xid=0xc98c5b)
[2023-03-29 10:46:41.473835 +0000] E [rpc-clnt.c:313:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7fc2277382b9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x742e)[0x7fc2276d542e] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x111)[0x7fc2276dc581] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0xf480)[0x7fc2276dd480] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7fc2276d858a] ))))) 0-share-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2023-03-29 10:45:44 +0000 (xid=0xc98c5c)
glusterd.log on the failed brick:
[2023-03-29 10:40:40.373980 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume share
[2023-03-29 10:40:45.204118 +0000] I [MSGID: 106487] [glusterd-handler.c:1452:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2023-03-29 10:45:40.850214 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume share
[2023-03-29 10:45:46.776439 +0000] I [MSGID: 106487] [glusterd-handler.c:1452:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2023-03-29 10:47:51.671709 +0000] I [MSGID: 106143] [glusterd-pmap.c:353:pmap_port_remove] 0-pmap: removing brick (null) on port 53885
[2023-03-29 10:47:51.691544 +0000] I [MSGID: 106005] [glusterd-handler.c:6419:__glusterd_brick_rpc_notify] 0-management: Brick cu-glstr-03-cl1:/data/glusterfs/share/brick1/brick has disconnected from glusterd.
[2023-03-29 10:47:51.692042 +0000] I [MSGID: 106143] [glusterd-pmap.c:353:pmap_port_remove] 0-pmap: removing brick /data/glusterfs/share/brick1/brick on port 53885
[2023-03-29 10:50:41.903172 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume share
[2023-03-29 10:50:47.412646 +0000] I [MSGID: 106487] [glusterd-handler.c:1452:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2023-03-29 10:55:43.525353 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume share
**- Is there any crash ? Provide the backtrace and coredump
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: pending frames:
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: frame : type(1) op(WRITE)
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: patchset: git://git.gluster.org/glusterfs.git
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: signal received: 11
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: time of crash:
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: 2023-03-29 10:45:40 +0000
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: configuration details:
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: argp 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: backtrace 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: dlfcn 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: libpthread 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: llistxattr 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: setfsid 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: epoll.h 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: xattr.h 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: st_atim.tv_nsec 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: package-string: glusterfs 11.0
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: ---------
On the failed brick log:
[2023-03-29 10:45:38.868149 +0000] I [posix-entry-ops.c:382:posix_lookup] 0-share-posix: <gfid:c71d188b-ce42-4854-a65e-96a060885c29>/1730/TERRA SANTA PARTENZA
CONFERMATA_page-0001(0).jpg: inode path not completely resolved. Asking for full path
[2023-03-29 10:45:40.128697 +0000] I [posix-entry-ops.c:382:posix_lookup] 0-share-posix: <gfid:c71d188b-ce42-4854-a65e-96a060885c29>/294/zanzibar300x300.jpg:
inode path not completely resolved. Asking for full path
[2023-03-29 10:45:40.857960 +0000] I [addr.c:52:compare_addr_and_update] 0-/data/glusterfs/share/brick1/brick: allowed = "*", received addr = "192.168.56.112"
[2023-03-29 10:45:40.857981 +0000] I [login.c:109:gf_auth] 0-auth/login: allowed user names: ad7fcb45-86cc-451d-96e9-a9a718f2eeea
[2023-03-29 10:45:40.857988 +0000] I [MSGID: 115029] [server-handshake.c:645:server_setvolume] 0-share-server: accepted client from CTX_ID:5b198cc4-89c1-4ad7-
a28d-6cbb0274f9c2-GRAPH_ID:0-PID:9038-HOST:cu-glstr-03-cl1-PC_NAME:share-client-2-RECON_NO:-0 (version: 11.0) with subvol /data/glusterfs/share/brick1/brick
[2023-03-29 10:45:40.889198 +0000] W [socket.c:751:__socket_rwv] 0-tcp.share-server: readv on 192.168.56.112:49146 failed (No data available)
[2023-03-29 10:45:40.889240 +0000] I [MSGID: 115036] [server.c:494:server_rpc_notify] 0-share-server: disconnecting connection [{client-uid=CTX_ID:5b198cc4-89
c1-4ad7-a28d-6cbb0274f9c2-GRAPH_ID:0-PID:9038-HOST:cu-glstr-03-cl1-PC_NAME:share-client-2-RECON_NO:-0}]
[2023-03-29 10:45:40.889391 +0000] I [MSGID: 101054] [client_t.c:374:gf_client_unref] 0-share-server: Shutting down connection CTX_ID:5b198cc4-89c1-4ad7-a28d-
6cbb0274f9c2-GRAPH_ID:0-PID:9038-HOST:cu-glstr-03-cl1-PC_NAME:share-client-2-RECON_NO:-0
[2023-03-29 10:45:40.889396 +0000] I [socket.c:3653:socket_submit_outgoing_msg] 0-tcp.share-server: not connected (priv->connected = -1)
[2023-03-29 10:45:40.889433 +0000] W [rpcsvc.c:1322:rpcsvc_callback_submit] 0-rpcsvc: transmission of rpc-request failed
pending frames:
frame : type(1) op(WRITE)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2023-03-29 10:45:40 +0000
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 11.0
/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x25954)[0x7fe2c416b954]
/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x698)[0x7fe2c41752f8]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe2c3f17520]
/lib/x86_64-linux-gnu/libglusterfs.so.0(__gf_free+0x69)[0x7fe2c418be59]
/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_unref+0x9e)[0x7fe2c411642e]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/protocol/server.so(+0xb0a6)[0x7fe2c01390a6]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/protocol/server.so(+0xb9a4)[0x7fe2c01399a4]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/debug/io-stats.so(+0x1a158)[0x7fe2c01ca158]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/quota.so(+0x12d42)[0x7fe2c01f6d42]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/index.so(+0xab05)[0x7fe2c0218b05]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/barrier.so(+0x7a58)[0x7fe2c022aa58]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/performance/io-threads.so(+0x7801)[0x7fe2c0267801]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0xd9bf)[0x7fe2c027f9bf]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0xdd2b)[0x7fe2c027fd2b]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0x12c0c)[0x7fe2c0284c0c]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0x2bba)[0x7fe2c0274bba]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/leases.so(+0x2c96)[0x7fe2c0293c96]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/locks.so(+0x12d9f)[0x7fe2c02e5d9f]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev_cbk+0x126)[0x7fe2c41db076]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/changelog.so(+0x845e)[0x7fe2c034e45e]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/storage/posix.so(+0x2f3ab)[0x7fe2c03ef3ab]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev+0xdf)[0x7fe2c41e681f]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/changelog.so(+0x10d7d)[0x7fe2c0356d7d]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/bitrot-stub.so(+0xcd02)[0x7fe2c0334d02]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev+0xdf)[0x7fe2c41e681f]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/locks.so(+0x148d0)[0x7fe2c02e78d0]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev+0xdf)[0x7fe2c41e681f]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/worm.so(+0x5ca7)[0x7fe2c02b9ca7]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/read-only.so(+0x4db6)[0x7fe2c02aedb6]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/leases.so(+0x8f5a)[0x7fe2c0299f5a]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0x7533)[0x7fe2c0279533]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev_resume+0x1ee)[0x7fe2c41e31ce]
/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x32b08)[0x7fe2c4178b08]
/lib/x86_64-linux-gnu/libglusterfs.so.0(call_resume+0x6d)[0x7fe2c418579d]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/performance/io-threads.so(+0x6700)[0x7fe2c0266700]
/lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7fe2c3f69b43]
/lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7fe2c3ffba00]
---------
[2023-03-29 10:59:32.127900 +0000] I [MSGID: 100030] [glusterfsd.c:2872:main] 0-/usr/sbin/glusterfsd: Started running version [{arg=/usr/sbin/glusterfsd}, {ve
rsion=11.0}, {cmdlinestr=/usr/sbin/glusterfsd -s cu-glstr-03-cl1 --volfile-id share.cu-glstr-03-cl1.data-glusterfs-share-brick1-brick -p /var/run/gluster/vols
/share/cu-glstr-03-cl1-data-glusterfs-share-brick1-brick.pid -S /var/run/gluster/dbbbf2b10a2790dd.socket --brick-name /data/glusterfs/share/brick1/brick -l /v
ar/log/glusterfs/bricks/data-glusterfs-share-brick1-brick.log --xlator-option *-posix.glusterd-uuid=37914111-9b77-4c72-b86d-a158803aa75f --process-name brick
--brick-port 59748 --xlator-option share-server.listen-port=59748}]
[2023-03-29 10:59:32.128730 +0000] I [glusterfsd.c:2562:daemonize] 0-glusterfs: Pid of current running process is 9643
[2023-03-29 10:59:32.137424 +0000] I [socket.c:916:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 10
[2023-03-29 10:59:32.138888 +0000] I [MSGID: 101188] [event-epoll.c:643:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}]
[2023-03-29 10:59:32.138967 +0000] I [MSGID: 101188] [event-epoll.c:643:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}]
[2023-03-29 10:59:32.157103 +0000] I [glusterfsd-mgmt.c:2336:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: cu-glstr-01-cl1:24007 cu-glstr-02-cl1:24007
[2023-03-29 10:59:32.164857 +0000] I [rpcsvc.c:2708:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
[2023-03-29 10:59:32.165277 +0000] I [io-stats.c:3784:ios_sample_buf_size_configure] 0-/data/glusterfs/share/brick1/brick: Configure ios_sample_buf size is 1
024 because ios_sample_interval is 0
[2023-03-29 10:59:32.166825 +0000] I [trash.c:2443:init] 0-share-trash: no option specified for 'eliminate', using NULL
[2023-03-29 10:59:32.223505 +0000] I [posix-common.c:371:posix_statfs_path] 0-share-posix: Set disk_size_after reserve is 1874321604608
Final graph:
+------------------------------------------------------------------------------+
1: volume share-posix
2: type storage/posix
3: option glusterd-uuid 37914111-9b77-4c72-b86d-a158803aa75f
4: option directory /data/glusterfs/share/brick1/brick
5: option volume-id 08d4902f-5f00-43eb-b068-4e350b67706b
6: option fips-mode-rchecksum on
7: option shared-brick-count 1
8: end-volume
9:
10: volume share-trash
11: type features/trash
12: option trash-dir .trashcan
13: option brick-path /data/glusterfs/share/brick1/brick
14: option trash-internal-op off
15: subvolumes share-posix
16: end-volume
17:
18: volume share-changelog
19: type features/changelog
20: option changelog-brick /data/glusterfs/share/brick1/brick
21: option changelog-dir /data/glusterfs/share/brick1/brick/.glusterfs/changelogs
22: option changelog-notification off
23: option changelog-barrier-timeout 120
24: subvolumes share-trash
25: end-volume
Additional info:
The restart of glusterd
process on the offline bricks recovered the situation. Now is healing the missing files.
- The operating system / glusterfs version:
Ubuntu 22.04 LTS updated
Just happened again with the node cu-glstr-02-cl1. Same load (copying via Samba about 30 GB of data in 31k files).
@icolombi Do you have any core dump we can look at?
How can I provide a core dump? Thanks
May be you can refer to this article based on Ubuntu 22.04
If you can find the core files, you can install the debug packages and attach the core files and get the backtrace using t a a bt
, or the best would be sharing the corefiles and I will take a look at it.
Thanks @rafikc30, I have the two dump files. Gzipped they are about 85 and 115 Mb, how can I share them with you?
@icolombi I think, The upload limit for attachments to a GitHub issue is currently 25 MB per file, you may want to consider using a file hosting service, such as Dropbox or Google Drive, and providing a link to the file in the GitHub issue.
Thanks. Does the dump includes sensitive data?
A core file contains in-memory data while a process received the signal that caused the process to crash like SIGSEGV. Mostly, we are interested in the variable values, state of transport, ref count of variables etc. So In general it may contain file names, some metadata information, and I think it is also possible to have some content if a write or read is happening while the core is generated
@icolombi I will take a look
I'm experiencing same issue. I had cluster that was stable as a rock on 10.x, since update to 11.0 one one periodically crashes in similar way to one reported here.
just as info..... on my side the last "stable" version is 10.1... every version after its just crashing after a period of time. something was essentially changed with 10.2
@rafikc30 Did you find the root cause of this issue? We are facing those too, especially with larger directory trees having millions of files in it. There I can reproduce this issue all the time if you need additional coredump information. The OS is Debian 11 with GlusterFS Version 11. We didn't face those issues in GlusterFS Version 10.x or lower. I tried to compile GlusterFS with the "--enable-debug" parameter for more detailed coredump information too, however with that option enabled the brick unfortunately never crashed. Thanks in advance.
@icolombi Can you please share "thread apply all bt full" output after attach a core with gdb.
Just wanted to add that I'm having the same issue ever since upgrading from Debian Bullseye to Bookworm (glusterfs-server 9.2 -> 10.3). 1 of 4 random gluster server processes seem to crash daily with similar to:
Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: pending frames: Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: patchset: git://git.gluster.org/glusterfs.git Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: signal received: 11 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: time of crash: Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: 2023-10-03 14:10:32 +0000 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: configuration details: Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: argp 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: backtrace 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: dlfcn 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: libpthread 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: llistxattr 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: setfsid 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: epoll.h 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: xattr.h 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: st_atim.tv_nsec 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: package-string: glusterfs 10.3 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: ---------
Yeah, we are experiencing this problem too. On a CentOS9 environment, with GlusterFS version 11.1. Brick log on my side is full with: [2023-12-21 14:14:59.494844 +0000] I [posix-entry-ops.c:382:posix_lookup] 0-avans-posix: gfid:dd768023-41d8-4b18-a26c-782b418818a1/149c5ebf-688b-4543-bcac-6dfb1ab7ebbf: inode path not completely resolved. Asking for full path
Maybe this can be helpful?
Yeah, we are experiencing this problem too. On a CentOS9 environment, with GlusterFS version 11.1. Brick log on my side is full with: [2023-12-21 14:14:59.494844 +0000] I [posix-entry-ops.c:382:posix_lookup] 0-avans-posix: gfid:dd768023-41d8-4b18-a26c-782b418818a1/149c5ebf-688b-4543-bcac-6dfb1ab7ebbf: inode path not completely resolved. Asking for full path
Maybe this can be helpful?
On 11.1 and my brick logs are full of the same "inode path not completely resolved. Asking for full path" errors.
Just to add to this, in digging a bit more, unless I'm misreading the logs, it appears the node the clients connect to to pull volume info appears to be "one version" old. By that I mean: Lets say I have 3 dispersed gluster nodes, each with a brick, and a 2+1 volume mounted to clients. Node 1 - brick 1 - port 1000 Node 2 - brick 2 - port 1001 Node 3 - brick 3 - port 1002
All clients mount by pointing to node 1.
At some point Node 2 crashes or I restart it. When it comes back up, the brick port changes to 2001. The clients still try to connect to 1001. If I kill the brick manually, restart glusterd, and it comes back online with 3001, the clients now try to connect to 2001. It's like it's advertising the port from the previous killed process for some reason.
Not sure if this matters but I do not have brick multiplexing enabled, nor shared storage, but I do have io_uring enabled.
@rafikc30 I'm running into this issue after upgrading from Ubuntu 22.04 with gluster 10.1 to Ubuntu 24.04 with gluster 11.1. I have multiple volumes, but the issue has only been triggered by a volume which backs a minio instance (lots of small file i/o):
Volume Name: minio
Type: Distribute
Volume ID: 1698d653-3c53-4955-b031-656951419885
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: nas-0:/pool-2/vol-0/minio/brick
Options Reconfigured:
diagnostics.brick-log-level: TRACE
performance.io-cache: on
performance.io-cache-size: 1GB
performance.quick-read-cache-timeout: 600
performance.parallel-readdir: on
performance.readdir-ahead: on
network.inode-lru-limit: 200000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
cluster.force-migration: off
performance.client-io-threads: on
cluster.readdir-optimize: on
diagnostics.client-log-level: ERROR
storage.fips-mode-rchecksum: on
transport.address-family: inet
My core dump is 1G due to cache settings and probably contains sensitive data, so I've only attached the brick backtrace and the last 10K lines of a trace-level brick log. Please let me know if there's anything else that would be helpful from the core dump.
I started looking through the 11 commits to inode.c
since v10.1. I haven't found anything obvious that would cause inode
to be null when passed to __inode_unref yet. Are there any relevant tests for this code?
I started looking through the 11 commits to
inode.c
since v10.1. I haven't found anything obvious that would causeinode
to be null when passed to __inode_unref yet. Are there any relevant tests for this code?
Could be https://github.com/gluster/glusterfs/commit/da2391dacd3483555e91a33ecdf89948be62b691
@mykaul yep, my core dump is exactly what's described in #4295 with ~5K recursive calls to inode_unref
. I'll escalate this to the Ubuntu package maintainers and see if they'll patch it in.
This also affects RHEL 9.4 and is a consistent issue for me; I've gotten to the point where I have to have a watchdog script in cron watching the process to make sure it doesn't die. Let me know if you need any additional troubleshooting info to fix this.
@Trae32566 that's unfortunate that this bug also affects RHEL. I don't think there's any troubleshooting left to do though as #4302 fixed issue. No release has the fix yet per https://github.com/gluster/glusterfs/issues/4295#issuecomment-2094665030. It's probably worth while adding a comment to that thread mentioning the impact to RHEL. Hopefully someone will cut a new release.
@Trae32566 that's unfortunate that this bug also affects RHEL. I don't think there's any troubleshooting left to do though as #4302 fixed issue. No release has the fix yet per #4295 (comment). It's probably worth while adding a comment to that thread mentioning the impact to RHEL. Hopefully someone will cut a new release.
I don't think it's OS specific. It just needs a Glusterfs release. @gluster/gluster-maintainers can do it.
Yep, this is not OS-specific. The stack overflow is a stack overflow on any OS. However since there's no tagged release which contains the fix, every distro is gradually picking up bugged versions.
I am using Gluster version 11.0 on Rocky Linux 8 and frequently experiencing sudden crashes of the same glusterfsd daemon. Currently, I don't have a solution, so I periodically check the daemon with cron and restart it.
When can I expect a fixed update version to be released?
@showinfo yeah I'm not sure what's going on. It seems like Gluster as a project is in a semi-maintained state after being dropped by Red Hat.
If you're feeling enterprising, can generate a patch and rebuild the rpm with something like this but for rpm (vs deb). The process for Rocky appears to be documented here.
I am using Gluster version 11.0 on Rocky Linux 8 and frequently experiencing sudden crashes of the same glusterfsd daemon. Currently, I don't have a solution, so I periodically check the daemon with cron and restart it.
When can I expect a fixed update version to be released?
For reference 11.1 has been out for a while and did address a brick crash bug. Not sure if this is the same one you are experiencing though.
This is going to get interesting. Which PR was it? I'm referring to #4302 which is not part of 11.1.