glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

glusterd: avoid starting the same brick twice

Open xhernandez opened this issue 1 year ago • 6 comments

There was a race in glusterd code that could cause that two threads start the same brick at the same time. One of the bricks will fail because it will detect the other brick running. Depending on which brick fails, glusterd will report a start failure and mark the brick as stopped even if it's running.

The problem is caused by an attempt to connect to a brick that's being started by another thread. If the brick is not fully initialized, it will refuse all connection attempts. When this happens, glusterd receives a disconnection notification, which forcibly marks the brick as stopped.

Now, if another attempt to start the same brick happens, it will believe that the brick is stopped and it will start it again. If this happens very soon after the first start attempt, the checks done to see if the brick is already running will still fail, triggering the start of the brick process again. One of the bricks will fail to initialize and will report an error. If the failed one is processed by glusterd in the second place, the brick will be marked as stopped, even though the process is actually there and working.

Fixes: #4080

xhernandez avatar Mar 29 '23 18:03 xhernandez

/run regression

xhernandez avatar Mar 29 '23 18:03 xhernandez

CLANG-FORMAT FAILURE: Before merging the patch, this diff needs to be considered for passing clang-format

index 91606dd40..a864a6cd2 100644
--- a/xlators/mgmt/glusterd/src/glusterd-handler.c
+++ b/xlators/mgmt/glusterd/src/glusterd-handler.c
@@ -6375,8 +6375,8 @@ __glusterd_brick_rpc_notify(struct rpc_clnt *rpc, void *mydata,
             }
 
             gf_msg(this->name, GF_LOG_INFO, 0, GD_MSG_BRICK_DISCONNECTED,
-                    "Brick %s:%s has disconnected from glusterd.",
-                    brickinfo->hostname, brickinfo->path);
+                   "Brick %s:%s has disconnected from glusterd.",
+                   brickinfo->hostname, brickinfo->path);
 
             ret = get_volinfo_from_brickid(brickid, &volinfo);
             if (ret) {
@@ -6385,8 +6385,7 @@ __glusterd_brick_rpc_notify(struct rpc_clnt *rpc, void *mydata,
                 goto out;
             }
             gf_event(EVENT_BRICK_DISCONNECTED, "peer=%s;volume=%s;brick=%s",
-                     brickinfo->hostname, volinfo->volname,
-                     brickinfo->path);
+                     brickinfo->hostname, volinfo->volname, brickinfo->path);
             /* In case of an abrupt shutdown of a brick PMAP_SIGNOUT
              * event is not received by glusterd which can lead to a
              * stale port entry in glusterd, so forcibly clean up
@@ -6405,7 +6404,8 @@ __glusterd_brick_rpc_notify(struct rpc_clnt *rpc, void *mydata,
                     gf_msg(this->name, GF_LOG_WARNING,
                            GD_MSG_PMAP_REGISTRY_REMOVE_FAIL, 0,
                            "Failed to remove pmap registry for port %d for "
-                           "brick %s", brickinfo->port, brickinfo->path);
+                           "brick %s",
+                           brickinfo->port, brickinfo->path);
                     ret = 0;
                 }
             }

gluster-ant avatar Mar 29 '23 18:03 gluster-ant

/run regression

xhernandez avatar Mar 29 '23 21:03 xhernandez

1 test(s) failed ./tests/bugs/rpc/bug-884452.t

0 test(s) generated core

5 test(s) needed retry ./tests/000-flaky/basic_afr_split-brain-favorite-child-policy.t ./tests/000-flaky/features_copy-file-range.t ./tests/000-flaky/glusterd-restart-shd-mux.t ./tests/bugs/replicate/bug-1655050-dir-sbrain-size-policy.t ./tests/bugs/rpc/bug-884452.t

1 flaky test(s) marked as success even though they failed ./tests/000-flaky/features_copy-file-range.t https://build.gluster.org/job/gh_centos7-regression/3248/

gluster-ant avatar Mar 30 '23 01:03 gluster-ant

/run s390-regression

xhernandez avatar Apr 13 '23 11:04 xhernandez

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar Dec 15 '23 08:12 stale[bot]