stargz-snapshotter icon indicating copy to clipboard operation
stargz-snapshotter copied to clipboard

Restart snapshotter gracefully even if there are running containers

Open luodw opened this issue 4 years ago • 8 comments

recently, I have tried containerd lazy load feature. stargz-snapshotter use fusefs to fetch file data on demand, and in fuse userspace handler, if file is cached locally, it will return directly, if not cached locally, it will fetch from docker registry.

my question is, if stargz-snapshotter restart abnormally or when upgrade, the fuse connections will break, so container process read will failed. Is there some good practice? image

luodw avatar Sep 22 '20 15:09 luodw

@luodw Thanks for the question! Though we have graceful shutdown on SIGINT (#26), recovery on abnormal shutdown / support for service restart are in progress (#134). Very welcome for contribution.

ktock avatar Sep 23 '20 02:09 ktock

@luodw Thanks for the question! Though we have graceful shutdown on SIGINT (#26), recovery on abnormal shutdown / support for service restart are in progress (#134). Very welcome for contribution.

Thanks for your reply, I got it.

luodw avatar Sep 23 '20 06:09 luodw

@luodw Can you check if the master version (contains the patch #134) fixes this issue?

ktock avatar Sep 24 '20 01:09 ktock

@luodw Can you check if the master version (contains the patch #134) fixes this issue?

I hava tried the latest master branch (containes the patch #134 ), but when I 'kill -9 ', and restart right now, the container still has err image

The follow steps reproduce the issue

  1. ctr-remote images rpull docker.io/stargz/golang:1.12.9-esgz
  2. ctr-remote run --rm -t --snapshotter=stargz docker.io/stargz/golang:1.12.9-esgz test /bin/bash
  3. kill -9 and restart right now
  4. run some commands in container

luodw avatar Sep 24 '20 08:09 luodw

Currently, you need to re-run containers too. And I agree with that the snapshotter needs to be able to gracefully restart even if there are running containers.

ktock avatar Sep 25 '20 02:09 ktock

Currently, you need to re-run containers too. And I agree with that the snapshotter needs to be able to gracefully restart even if there are running containers.

Ok,I also think the ideal usage is when snapshotter restarts, the running containers can still run normally.

luodw avatar Sep 25 '20 02:09 luodw

@ktock can you describe what is required to do an update/restart to the snapshotter in a running cluster for instance? How do you do that today?

amrmahdi avatar Jan 14 '21 20:01 amrmahdi

Currently, we need to kill all containers running on that node before restarting this snapshotter and re-deploy these containers after the snapshotter restarts.

One of the idea to solve this issue is spawning the FUSE server as a separated process instead of goroutine as done today.

ktock avatar Jan 15 '21 00:01 ktock