magpie icon indicating copy to clipboard operation
magpie copied to clipboard

Safely shutdown each part

Open bpanneton opened this issue 9 years ago • 12 comments

As the services are brought up, I think we need a way to shut them down individually or a way to shut down all the ones that were successful coming up.

I have run into the case (while testing something else) that Kafka failed to come up properly. It would be helpful to be able to shutdown everything that came up before it safely rather than individually going to each node or running a killall kind of thing.

Perhaps the way to do this would be as they successfully come up, we add the shutdown sequence to a script on the master node. Then we can just call that script. (perhaps just magpie-cleanup and magpie-post-run but run as a new job?)

bpanneton avatar Jan 21 '16 13:01 bpanneton

Am I not handling this in magpie-run correctly? Variables with "should_be_torndown" are set along the way for this purpose? Perhaps I'm missing something.

chu11 avatar Jan 21 '16 18:01 chu11

I think it is handled fine within magpie-run. However in the event that something happens during the startup which stalls or breaks the startup, how can we tear it down? Lets say the magpie-run script dies in the middle before the teardown. Now we don't have an easy way to bring it down safely. I'm mainly worried about corrupting or breaking hdfs/hbase.

bpanneton avatar Jan 21 '16 20:01 bpanneton

Ahhh, I see, you're speaking more of the case that something fails in magpie-run.

At one point the past I actually tried to split magpie-run into (something like) magpie-run-startup, magpie-run-main, and magpie-run-shutdown. I stopped when I found it cumbersome to try and move variable information between the the scripts. That was a long time ago when I supported far fewer projects than I do now. In addition, if magpie-run-startup fails, we know we can skip whatever is in the magpie-run-main script.

Perhaps it's time to revisit that??

chu11 avatar Jan 21 '16 22:01 chu11

This is probably not the best way, but during setup maybe we can generate a per-node-variable list. Then each script can source it when it starts. It would contain only the variables of the services that were started. This would allow us to eliminate the interrupt potentially. For instance, if you run in interactive mode the script does nothing at the end. To shutdown you can call the shutdown script with the per-node-variables list. However this might cause and issue when the resource manager decided to kill the job.

bpanneton avatar Jan 22 '16 14:01 bpanneton

It might be better to leave it the way it is, but also generate a separate shutdown script with the variable already added.

bpanneton avatar Jan 22 '16 14:01 bpanneton

What do you mean by "each script can source it when it starts"? Which scripts? Perhaps I need some pseudo-code to help understand what you're suggesting.

My initial thought was something like after Hadoop starts up in magpie-run, create a variable that indicates "hadoop started on this node". After magpie-cleanup, a new script could check "was hadoop started on this node"? If yes and hadoop is still running, run the normal hadoop-shutdown code.

chu11 avatar Jan 22 '16 18:01 chu11

I figured each service would create its own shutdown script. Then magpie would have a shutdown script which would call each of the others. This way each service can set the proper variables within the shutdown script. Your way works as well and might be easier.

bpanneton avatar Jan 26 '16 17:01 bpanneton

An additional thought. I can see the need for this service for several services, most notably HDFS b/c we want to avoid corruption scenarios. But we probably don't need it for other services. The processes can be killed by the scheduler/resource manager whenever the job dies.

chu11 avatar Jan 26 '16 18:01 chu11

I'd prefer one for each service since for some reason my LSF doesn't terminate the processes running on the nodes. It just allows the nodes to be reused.

bpanneton avatar Jan 26 '16 19:01 bpanneton

Oh really ... that's interesting. Suddenly I understand your need for this then. In our environment lingering processes of the user will be killed off before the node is re-used.

chu11 avatar Jan 26 '16 19:01 chu11

Brian, would you like to program up a simple prototype first. Just for one project. Lets say ... Zookeeper? That's probably the simplest.

chu11 avatar Jan 26 '16 19:01 chu11

I can try. It might be a while until I get some free time to do it.

bpanneton avatar Jan 26 '16 19:01 bpanneton