glusterd2 icon indicating copy to clipboard operation
glusterd2 copied to clipboard

Old brick process is still running after volume reset -> stop -> start

Open PrasadDesala opened this issue 5 years ago • 4 comments

Observed behavior

On a brick-mux enabled setup, old brick process is still running after volume reset -> stop -> start.

Expected/desired behavior

Old brick process should not be running.

Details on how to reproduce (minimal and precise)

  1. Create a 3 node gcs system using valgrind.
  2. With brick-mux enabled, create 100 pvc.
  3. Pick a volume and change one volume option so that brick mux spawns a new process for that volume. glustercli volume set pvc-520682df-0e6e-11e9-af0b-525400f94cb8 cluster/replicate.self-heal-daemon off --advanced
  4. Volume stop/start that volume. A new process is spawned on that node.
  5. Now, reset the volume option and stop/start the volume. glustercli volume reset pvc-520682df-0e6e-11e9-af0b-525400f94cb8 cluster/replicate.self-heal-daemon
  6. pgrep glusterfsd; we can see the old brick process is still running.

Information about the environment:

Glusterd2 version used (e.g. v4.1.0 or master): v6.0-dev.94.git601ba61 Operating system used: Centos 7.6 Glusterd2 compiled from sources, as a package (rpm/deb), or container: Using External ETCD: (yes/no, if yes ETCD version): yes; version 3.3.8 If container, which container image: Using kubernetes, openshift, or direct install: If kubernetes/openshift, is gluster running inside kubernetes/openshift or outside: Kubernetes ```

PrasadDesala avatar Jan 02 '19 10:01 PrasadDesala

[root@gluster-kube1-0 /]# ps -ef | grep -i glusterfsd root 9425 11692 0 11:16 pts/4 00:00:00 grep --color=auto -i glusterfsd root 9469 1 0 10:36 ? 00:00:00 /usr/sbin/glusterfsd --volfile-server gluster-kube1-0.glusterd2.gcs --volfile-server-port 24007 --volfile-id pvc-520682df-0e6e-11e9-af0b-525400f94cb8.4e752b45-aa0a-4784-83f7-6b487e886b4d.var-run-glusterd2-bricks-pvc-520682df-0e6e-11e9-af0b-525400f94cb8-subvol1-brick1-brick -p /var/run/glusterd2/4e752b45-aa0a-4784-83f7-6b487e886b4d-var-run-glusterd2-bricks-pvc-520682df-0e6e-11e9-af0b-525400f94cb8-subvol1-brick1-brick.pid -S /var/run/glusterd2/e7e1ef348943f9ac.socket --brick-name /var/run/glusterd2/bricks/pvc-520682df-0e6e-11e9-af0b-525400f94cb8/subvol1/brick1/brick -l /var/log/glusterd2/glusterfs/bricks/var-run-glusterd2-bricks-pvc-520682df-0e6e-11e9-af0b-525400f94cb8-subvol1-brick1-brick.log --xlator-option *-posix.glusterd-uuid=4e752b45-aa0a-4784-83f7-6b487e886b4d root 11720 1 10 09:29 ? 00:11:08 /usr/sbin/glusterfsd --volfile-server gluster-kube1-0.glusterd2.gcs --volfile-server-port 24007 --volfile-id pvc-381a5faf-0e6e-11e9-af0b-525400f94cb8.4e752b45-aa0a-4784-83f7-6b487e886b4d.var-run-glusterd2-bricks-pvc-381a5faf-0e6e-11e9-af0b-525400f94cb8-subvol1-brick1-brick -p /var/run/glusterd2/4e752b45-aa0a-4784-83f7-6b487e886b4d-var-run-glusterd2-bricks-pvc-381a5faf-0e6e-11e9-af0b-525400f94cb8-subvol1-brick1-brick.pid -S /var/run/glusterd2/3c5c17b3422e2a07.socket --brick-name /var/run/glusterd2/bricks/pvc-381a5faf-0e6e-11e9-af0b-525400f94cb8/subvol1/brick1/brick -l /var/log/glusterd2/glusterfs/bricks/var-run-glusterd2-bricks-pvc-381a5faf-0e6e-11e9-af0b-525400f94cb8-subvol1-brick1-brick.log --xlator-option *-posix.glusterd-uuid=4e752b45-aa0a-4784-83f7-6b487e886b4d

statedump_kube-1.txt

kube3-glusterd2.log.gz kube2-glusterd2.log.gz kube1-glusterd2.log.gz

PrasadDesala avatar Jan 02 '19 11:01 PrasadDesala

@PrasadDesala old brick process is serving the other 99 PVCs, isn't it? I fail to understand why is this a bug?

atinmu avatar Jan 02 '19 11:01 atinmu

@PrasadDesala old brick process is serving the other 99 PVCs, isn't it? I fail to understand why is this a bug?

Initially brick process p1 is serving all the volumes. Once I changed the volume option to a volume (lets say PVC100) and after volume stop/start of that volume a new brick process p2 is serving it. Now, p1 is serving 99 PVCs and p2 is serving PVC100 which is working as expected.

Now I have reset PVC100 and stop/start the volume. I see p2 process is still running, there is no need for this process to run as now all the PVCs are having same default volume options and p1 is serving all PVCs.

PrasadDesala avatar Jan 02 '19 13:01 PrasadDesala

Now I have reset PVC100 and stop/start the volume. I see p2 process is still running, there is no need for this process to run as now all the PVCs are having same default volume options and p1 is serving all PVCs.

Hmm, I think this process was registered in the daemon which is why it still comes up as separate process. @vpandey-RH is there a easy way to handle this scenario?

In any case, please note in GCS environment volume reset isn't an operation which we'd recommend users to perform. So the priority of this issue should remain as low.

atinmu avatar Jan 02 '19 13:01 atinmu