glusterd service fails to start after normal reboot
Description of problem: glusterd service fails to start after a normal reboot
The exact command to reproduce the issue: I tried to debug the start sequence using:
/usr/sbin/glusterd --debug
The full output of the command that failed:
[2023-11-20 17:14:21.135785] I [MSGID: 100030] [glusterfsd.c:2865:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 7.2 (args: /usr/sbin/glusterd --debug) [2023-11-20 17:14:21.135909] I [glusterfsd.c:2593:daemonize] 0-glusterfs: Pid of current running process is 2256 [2023-11-20 17:14:21.135931] D [logging.c:1717:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5 [2023-11-20 17:14:21.136832] D [MSGID: 0] [glusterfsd.c:810:get_volfp] 0-glusterfsd: loading volume file /etc/glusterfs/glusterd.vol input in flex scanner failed
Expected results:
Mandatory info:
- The output of the gluster volume info command:
can't run command without the glusterd service
- The output of the gluster volume status command:
can't run command without the glusterd service
- The output of the gluster volume heal command:
can't run command without the glusterd service
**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/
[2023-11-20 17:09:26.357672] I [MSGID: 106513] [glusterd-store.c:2256:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 70200 [2023-11-20 17:09:26.382993] W [MSGID: 106204] [glusterd-store.c:3274:glusterd_store_update_volinfo] 0-management: Unknown key: tier-enabled [2023-11-20 17:09:26.383415] W [MSGID: 106204] [glusterd-store.c:3274:glusterd_store_update_volinfo] 0-management: Unknown key: brick-0 [2023-11-20 17:09:26.383499] W [MSGID: 106204] [glusterd-store.c:3274:glusterd_store_update_volinfo] 0-management: Unknown key: brick-1 [2023-11-20 17:09:26.383571] W [MSGID: 106204] [glusterd-store.c:3274:glusterd_store_update_volinfo] 0-management: Unknown key: brick-2 [2023-11-20 17:09:26.435784] I [MSGID: 106544] [glusterd.c:151:glusterd_uuid_init] 0-management: retrieved UUID: 22581674-8d3a-4d23-8d79-cfdf7ad8e663 [2023-11-20 17:09:26.785273] I [MSGID: 106498] [glusterd-handler.c:3518:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2023-11-20 17:09:26.785602] W [MSGID: 106061] [glusterd-handler.c:3314:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2023-11-20 17:09:26.785687] I [rpc-clnt.c:1013:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2023-11-20 17:09:26.806654] E [MSGID: 106187] [glusterd-store.c:4934:glusterd_resolve_all_bricks] 0-glusterd: Failed to resolve brick /mnt/raid-1/gluster-brick with host pi4-m4 of volume gluster-vol in restore [2023-11-20 17:09:26.806771] E [MSGID: 101019] [xlator.c:626:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2023-11-20 17:09:26.806808] E [MSGID: 101066] [graph.c:424:glusterfs_graph_init] 0-management: initializing translator failed [2023-11-20 17:09:26.806837] E [MSGID: 101176] [graph.c:778:glusterfs_graph_activate] 0-graph: init failed [2023-11-20 17:09:26.810110] W [glusterfsd.c:1595:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xd4) [0xaaaad9be6444] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x218) [0xaaaad9be6358] -->> [2023-11-20 17:09:26.810272] W [mgmt-pmap.c:132:rpc_clnt_mgmt_pmap_signout] 0-glusterfs: failed to create XDR payload
**- Is there any crash ? Provide the backtrace and coredump
n/a
Additional info:
it appears from the log that the peer pi4-m4 is not accessible, I pinged all the peers and they respond without delay so not sure why it could not access the peer's brick...
also ran a disk check on all volumes to make sure corrupted config files (if any) are recovered.
- The operating system / glusterfs version:
Linux pi4-m4 5.4.0-1097-raspi #109-Ubuntu SMP PREEMPT Wed Oct 11 16:15:36 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux glusterfs 7.2 Repository revision: git://git.gluster.org/glusterfs.git
Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration
In our case also we are seeing this issue. We are having one volume with 3-brick:3-replica setup with 3-nodes. in one node we have restarted all gluster services. And after that glusterd was continously crashing, stating 0-management: Initialization of volume 'management' failed, review your volfile again.
Logs from GlusterD:
[2023-10-23 04:35:15.477485] W [glusterfsd.c:1570:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f50547c96db] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xfd) [0x55de29541b4d] -->/usr/sbin/glusterd(cleanup_and_exit+0x54) [0x55de29541994] ) 0-: received signum (15), shutting down [2023-10-23 04:38:56.022520] I [MSGID: 100030] [glusterfsd.c:2847:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 6.10 (args: /usr/sbin/glusterd -N -p /var/run/glusterd.pid) [2023-10-23 04:38:56.022658] I [glusterfsd.c:2556:daemonize] 0-glusterfs: Pid of current running process is 88 [2023-10-23 04:38:56.071590] I [MSGID: 106478] [glusterd.c:1422:init] 0-management: Maximum allowed open file descriptors set to 65536 [2023-10-23 04:38:56.071663] I [MSGID: 106479] [glusterd.c:1478:init] 0-management: Using /var/lib/glusterd/ as working directory [2023-10-23 04:38:56.071704] I [MSGID: 106479] [glusterd.c:1484:init] 0-management: Using /var/run/gluster as pid file working directory [2023-10-23 04:38:56.073867] I [socket.c:1022:__socket_server_bind] 0-socket.management: process started listening on port (24007) [2023-10-23 04:38:56.097616] I [socket.c:965:__socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 11 [2023-10-23 04:38:56.097971] I [MSGID: 106059] [glusterd.c:1860:init] 0-management: base-port override: 49152 [2023-10-23 04:38:56.097982] I [MSGID: 106059] [glusterd.c:1865:init] 0-management: max-port override: 49152 [2023-10-23 04:39:04.956934] I [MSGID: 106513] [glusterd-store.c:2394:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 60000 [2023-10-23 04:39:34.308779] I [MSGID: 106544] [glusterd.c:152:glusterd_uuid_init] 0-management: retrieved UUID: ddc1d9f0-a6e5-4751-8e35-0f91d326fa41 [2023-10-23 04:39:34.912425] I [MSGID: 106498] [glusterd-handler.c:3687:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2023-10-23 04:39:34.912618] I [MSGID: 106498] [glusterd-handler.c:3687:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2023-10-23 04:39:34.912669] W [MSGID: 106061] [glusterd-handler.c:3490:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2023-10-23 04:39:34.912720] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2023-10-23 04:39:34.916260] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 Final graph: +------------------------------------------------------------------------------+ 1: volume management 2: type mgmt/glusterd 3: option rpc-auth.auth-glusterfs on 4: option rpc-auth.auth-unix on 5: option rpc-auth.auth-null on 6: option rpc-auth-allow-insecure on 7: option transport.listen-backlog 1024 8: option max-port 49152 9: option base-port 49152 10: option transport.address-family inet 11: option transport.socket.listen-port 24007 12: option event-threads 1 13: option ping-timeout 0 14: option transport.socket.read-fail-log off 15: option transport.socket.keepalive-interval 2 16: option transport.socket.keepalive-time 10 17: option glusterd-sockfile /etc/glusterd_socket/gluster.sock 18: option transport-type socket 19: option working-directory /var/lib/glusterd/ 20: end-volume 21: +------------------------------------------------------------------------------+ [2023-10-23 04:39:34.916252] W [MSGID: 106061] [glusterd-handler.c:3490:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2023-10-23 04:39:35.015736] I [MSGID: 101190] [event-epoll.c:688:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 [2023-10-23 04:39:35.017093] I [MSGID: 106487] [glusterd-handler.c:1516:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2023-10-23 04:39:35.017342] I [MSGID: 106487] [glusterd-handler.c:1516:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2023-10-23 04:39:35.021484] I [MSGID: 106163] [glusterd-handshake.c:1389:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 60000 [2023-10-23 04:39:35.022776] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 0c333528-b176-4229-8df5-230844b7ee6f, host: <<Brick-2 hostname>>, port: 0 [2023-10-23 04:39:35.562534] I [glusterd-utils.c:6314:glusterd_brick_start] 0-management: starting a fresh brick process for brick /mnt/bricks/ndp_brick [2023-10-23 04:39:36.599410] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2023-10-23 04:40:04.394492] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600 [2023-10-23 04:40:04.394624] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: nfs already stopped [2023-10-23 04:40:04.394652] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: nfs service is stopped [2023-10-23 04:40:04.394972] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2023-10-23 04:40:04.395758] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: glustershd already stopped [2023-10-23 04:40:04.395783] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: glustershd service is stopped [2023-10-23 04:40:04.395806] I [MSGID: 106567] [glusterd-svc-mgmt.c:220:glusterd_svc_start] 0-management: Starting glustershd service [2023-10-23 04:40:05.400050] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600 [2023-10-23 04:40:05.400307] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: quotad already stopped [2023-10-23 04:40:05.400340] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: quotad service is stopped [2023-10-23 04:40:05.400376] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600 [2023-10-23 04:40:05.400515] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: bitd already stopped [2023-10-23 04:40:05.400542] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: bitd service is stopped [2023-10-23 04:40:05.400571] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600 [2023-10-23 04:40:05.400693] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: scrub already stopped [2023-10-23 04:40:05.400711] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: scrub service is stopped [2023-10-23 04:40:05.400754] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2023-10-23 04:40:05.400888] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600 [2023-10-23 04:40:05.401121] I [MSGID: 106492] [glusterd-handler.c:2796:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 0c333528-b176-4229-8df5-230844b7ee6f [2023-10-23 04:40:23.695495] I [MSGID: 106502] [glusterd-handler.c:2837:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2023-10-23 04:40:23.695661] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 0c333528-b176-4229-8df5-230844b7ee6f [2023-10-23 04:40:23.714829] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: c03c5b65-fe44-4342-bacd-76947c016646, host: <<Brick-0 hostname>>, port: 0 [2023-10-23 04:40:25.356022] I [MSGID: 106490] [glusterd-handler.c:2611:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 0c333528-b176-4229-8df5-230844b7ee6f [2023-10-23 04:40:25.356665] I [MSGID: 106009] [glusterd-utils.c:3466:glusterd_compare_friend_volume] 0-management: Version of volume ndp_vol differ. local version = 7370, remote version = 7382 on peer <<Brick-2 hostname>> [2023-10-23 04:40:25.357955] I [MSGID: 106493] [glusterd-handler.c:3883:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to <<Brick-2 hostname>> (0), ret: 0, op_ret: 0 [2023-10-23 04:40:26.595064] W [glusterfsd.c:1570:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f1e9eb816db] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xfd) [0x55d33e735b4d] -->/usr/sbin/glusterd(cleanup_and_exit+0x54) [0x55d33e735994] ) 0-: received signum (15), shutting down [2023-10-23 04:40:30.011947] I [MSGID: 100030] [glusterfsd.c:2847:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 6.10 (args: /usr/sbin/glusterd -N -p /var/run/glusterd.pid) [2023-10-23 04:40:30.012251] I [glusterfsd.c:2556:daemonize] 0-glusterfs: Pid of current running process is 77 [2023-10-23 04:40:30.096321] I [MSGID: 106478] [glusterd.c:1422:init] 0-management: Maximum allowed open file descriptors set to 65536 [2023-10-23 04:40:30.096375] I [MSGID: 106479] [glusterd.c:1478:init] 0-management: Using /var/lib/glusterd/ as working directory [2023-10-23 04:40:30.096392] I [MSGID: 106479] [glusterd.c:1484:init] 0-management: Using /var/run/gluster as pid file working directory [2023-10-23 04:40:30.099942] I [socket.c:1022:__socket_server_bind] 0-socket.management: process started listening on port (24007) [2023-10-23 04:40:30.101375] I [socket.c:965:__socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 11 [2023-10-23 04:40:30.101751] I [MSGID: 106059] [glusterd.c:1860:init] 0-management: base-port override: 49152 [2023-10-23 04:40:30.101766] I [MSGID: 106059] [glusterd.c:1865:init] 0-management: max-port override: 49152 [2023-10-23 04:40:33.389696] I [MSGID: 106513] [glusterd-store.c:2394:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 60000 [2023-10-23 04:40:33.390046] E [MSGID: 106206] [glusterd-store.c:3139:glusterd_store_update_volinfo] 0-management: Failed to get next store iter [2023-10-23 04:40:33.390062] E [MSGID: 106207] [glusterd-store.c:3404:glusterd_store_retrieve_volume] 0-management: Failed to update volinfo for ndp_vol volume [2023-10-23 04:40:33.390092] E [MSGID: 106201] [glusterd-store.c:3641:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: ndp_vol [2023-10-23 04:40:33.390119] E [MSGID: 101019] [xlator.c:629:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2023-10-23 04:40:33.390128] E [MSGID: 101066] [graph.c:362:glusterfs_graph_init] 0-management: initializing translator failed [2023-10-23 04:40:33.390134] E [MSGID: 101176] [graph.c:725:glusterfs_graph_activate] 0-graph: init failed [2023-10-23 04:40:33.390230] W [glusterfsd.c:1570:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xf7) [0x561488e5cc27] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x22f) [0x561488e5cacf] -->/usr/sbin/glusterd(cleanup_and_exit+0x54) [0x561488e58994] ) 0-: received signum (-1), shutting down
FYI: We are running GlusterFS-6.10
Some more info: When we look at the config directly of the GlusterFS, we saw that some files are missing. /var/lib/glusterfs/vols/ndp_vol/ndp_vol.<<brick-hostnames*>>.mnt-bricks-ndp_brick.vol - This file went missing on the GlusterFS crashing node.
@git2212 - Just a suggestion, could you also check, if these config files are missing for you?