glusterfs
glusterfs copied to clipboard
debian systemd services don't stop volumes and daemons properly
Description of problem:
Provided service files for debian packages don't stop all gluster processes when stopped. This causes issues with stopping volumes and leaves zombie processes that have to be killed manually.
I noticed this for the past several versions, but I don't think it was the case for 5.x.
The exact command to reproduce the issue:
root@server1:/home/user# systemctl | grep gluster
glusterd.service loaded active running GlusterFS, a clustered file-system server
glustereventsd.service loaded active running Gluster Events Notifier
root@server1:/home/user# systemctl status glusterd.service glustereventsd.service
● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-11-06 09:06:44 UTC; 1 day 17h ago
Docs: man:glusterd(8)
Process: 1032 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 1046 (glusterd)
Tasks: 158 (limit: 4661)
Memory: 3.1G
CGroup: /system.slice/glusterd.service
├─1046 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
├─1178 /usr/sbin/glusterfsd -s server1.localdomain --volfile-id volume2.server1.localdomain.data-brick1-b1 -p /var/run/gluster/vols/volume2/server1.localdomain-data-brick1-b1.pid -S /var/run/gluster/cef39469c59c165a.socket --brick-name /data/brick1/b1 -l /var/log/glusterfs/bricks/data-brick1-b1.log --xlator-option *-posix.glusterd-uuid=GUID1 --process-name brick >
├─1214 /usr/sbin/glusterfsd -s server1.localdomain --volfile-id volume1.server1.localdomain.data-brick2-b1 -p /var/run/gluster/vols/volume1/server1.localdomain-data-brick2-b1.pid -S /var/run/gluster/64afd89aabbe69d4.socket --brick-name /data/brick2/b1 -l /var/log/glusterfs/bricks/data-brick2-b1.log --xlator-option *-posix.glusterd-uuid=GUID1 --process-name >
├─1261 /usr/sbin/glusterfs -s localhost --volfile-id gluster/bitd -p /var/run/gluster/bitd/bitd.pid -l /var/log/glusterfs/bitd.log -S /var/run/gluster/9bbe88f3027a5730.socket --global-timer-wheel
├─1493 /usr/sbin/glusterfs -s localhost --volfile-id gluster/scrub -p /var/run/gluster/scrub/scrub.pid -l /var/log/glusterfs/scrub.log -S /var/run/gluster/775ff10403118051.socket --global-timer-wheel
└─1609 /usr/sbin/glusterfs -s localhost --volfile-id shd/volume2 -p /var/run/gluster/shd/volume2/volume2-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/143682d2ae48b0c0.socket --xlator-option *replicate*.node-uuid=GUID1 --process-name glustershd --client-pid=-6
Nov 06 09:06:34 server1 systemd[1]: Starting GlusterFS, a clustered file-system server...
Nov 06 09:06:44 server1 systemd[1]: Started GlusterFS, a clustered file-system server.
● glustereventsd.service - Gluster Events Notifier
Loaded: loaded (/lib/systemd/system/glustereventsd.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-11-06 09:06:34 UTC; 1 day 17h ago
Docs: man:glustereventsd(8)
Main PID: 1034 (glustereventsd)
Tasks: 4 (limit: 4661)
Memory: 11.8M
CGroup: /system.slice/glustereventsd.service
├─1034 /usr/bin/python3 /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
└─1692 /usr/bin/python3 /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
Nov 06 09:06:34 server1 systemd[1]: Started Gluster Events Notifier.
root@server1:/home/user# systemctl stop glusterd.service glustereventsd.service
root@server1:/home/user# systemctl status glusterd.service glustereventsd.service
● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Sun 2020-11-08 02:47:09 UTC; 5s ago
Docs: man:glusterd(8)
Process: 1032 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 1046 (code=exited, status=15)
Nov 06 09:06:34 server1 systemd[1]: Starting GlusterFS, a clustered file-system server...
Nov 06 09:06:44 server1 systemd[1]: Started GlusterFS, a clustered file-system server.
Nov 08 02:47:09 server1 systemd[1]: Stopping GlusterFS, a clustered file-system server...
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Succeeded.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1178 (glusterfsd) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1214 (glusterfsd) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1261 (glusterfs) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1493 (glusterfs) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1609 (glusterfs) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: Stopped GlusterFS, a clustered file-system server.
● glustereventsd.service - Gluster Events Notifier
Loaded: loaded (/lib/systemd/system/glustereventsd.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Sun 2020-11-08 02:47:09 UTC; 6s ago
Docs: man:glustereventsd(8)
Process: 1034 ExecStart=/usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid (code=killed, signal=TERM)
Main PID: 1034 (code=killed, signal=TERM)
Nov 06 09:06:34 server1 systemd[1]: Started Gluster Events Notifier.
Nov 08 02:47:09 server1 systemd[1]: Stopping Gluster Events Notifier...
Nov 08 02:47:09 server1 systemd[1]: glustereventsd.service: Succeeded.
Nov 08 02:47:09 server1 systemd[1]: Stopped Gluster Events Notifier.
root@server1:/home/user# ps -Af | grep gluster
root 1178 1 5 Nov06 ? 02:15:15 /usr/sbin/glusterfsd -s server1.localdomain --volfile-id volume2.server1.localdomain.data-brick1-b1 -p /var/run/gluster/vols/volume2/server1.localdomain-data-brick1-b1.pid -S /var/run/gluster/cef39469c59c165a.socket --brick-name /data/brick1/b1 -l /var/log/glusterfs/bricks/data-brick1-b1.log --xlator-option *-posix.glusterd-uuid=GUID1 --process-name brick --brick-port 49152 --global-threading --xlator-option volume2-server.listen-port=49152
root 1214 1 6 Nov06 ? 02:51:54 /usr/sbin/glusterfsd -s server1.localdomain --volfile-id volume1.server1.localdomain.data-brick2-b1 -p /var/run/gluster/vols/volume1/server1.localdomain-data-brick2-b1.pid -S /var/run/gluster/64afd89aabbe69d4.socket --brick-name /data/brick2/b1 -l /var/log/glusterfs/bricks/data-brick2-b1.log --xlator-option *-posix.glusterd-uuid=GUID1 --process-name brick --brick-port 49153 --xlator-option volume1-server.listen-port=49153
root 1261 1 0 Nov06 ? 00:17:40 /usr/sbin/glusterfs -s localhost --volfile-id gluster/bitd -p /var/run/gluster/bitd/bitd.pid -l /var/log/glusterfs/bitd.log -S /var/run/gluster/9bbe88f3027a5730.socket --global-timer-wheel
root 1493 1 0 Nov06 ? 00:00:15 /usr/sbin/glusterfs -s localhost --volfile-id gluster/scrub -p /var/run/gluster/scrub/scrub.pid -l /var/log/glusterfs/scrub.log -S /var/run/gluster/775ff10403118051.socket --global-timer-wheel
root 1609 1 0 Nov06 ? 00:08:57 /usr/sbin/glusterfs -s localhost --volfile-id shd/volume2 -p /var/run/gluster/shd/volume2/volume2-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/143682d2ae48b0c0.socket --xlator-option *replicate*.node-uuid=GUID1 --process-name glustershd --client-pid=-6
root 96093 95946 0 02:47 pts/1 00:00:00 grep gluster
Expected results:
systemctl stop glusterd.service stops all volumes and processes, including bitrot and self-heal. I could also see it make sense to have self-heal and bitrot daemons as separate services, but regardless there should be a way to reliably stop any ststemd-started gluster process via systemctl.
- The operating system / glusterfs version:
debian 10 buster / debian 11 bullseye
glusterfs 8.2-1. Also true for 8.0, 8.1, and I think 7.x. It was not the case for 5.x IIRC.
Seen something like that.
It looks like the processes are sent SIGKILL first and not SIGTERM by systemd. Maybe the glusterd.service needs some updates.
There is a script /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh that can be used to shutdown gluster.
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
Still an issue
It is intentional to not stop all the processes if Glusterd is Stopped. If Bricks are up then already connected Clients/Mounts continue to work even if Glusterd goes down. Think about restarting a Glusterd to fix an issue or to fix a memory leak, This doesn't mean all the other services should be stopped.
For now you can use the script that @jronnblom suggested https://github.com/gluster/glusterfs/issues/1767#issuecomment-731050742
@aravindavk The general contract should be that whatever is being brought up by systemctl start, should also be brought down by systemctl stop.
As there currently is nothing implemented for reload for glusterd.service, perhaps the scenario you describe can be addressed by reload rather than restart?
The alternative would be service splitting, either a general glusterd-bricks/glusterd-volumes or glusterd-brick@foobar.
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
Still an issue
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.
This is still an issue.
When a node is shut down or rebooted, the gluster volumes with bricks on the server affected all hang for the 42 seconds default timeout. This is entirely avoidable by killing all gluster processes properly during shutdown.
I understand the logic behind the glusterd.service 'restart' but it gets really annoying when all my VMs go unresponsive because an overheat caused a graceful server shutdown.
Still an issue in 9.4.
Got the same issue. On Debian I don't have instances of glusterfsd, only glusterd that spawns multiple processes, which are not killed when glusterd is stopped. It's because of KillMode=process in /lib/systemd/system/glusterd.service.
To resolve this issue, I am using the following:
/etc/systemd/system/glusterd.service.d/override.conf:
[Service]
KillMode=control-group
Use systemctl daemon-reload to apply the changes.
Remember what man systemd.kill says about this:
Note that it is not recommended to set KillMode= to process or even none, as this allows processes to escape the service manager's lifecycle and resource management, and to remain running even while their service is considered stopped and is assumed to not consume any resources.
Either the KillMode should be set to control-group, or there should be separate glusterfsd services. I remember reading an issue where glusterfs didn't want to mount, because the address (port) was already in use. After a reboot it was resolved. I had the same issue, but then I found out it's because of these glusterfsd processes lingering after a restart of the glusterd service.