clair
clair copied to clipboard
WIP: fix graceful shutdown, add SIGTERM handler
An attempt to address issue https://github.com/quay/clair/issues/1946
👍 Adds signal handling for SIGTERM 👍 Allows graceful shutdown to complete (instead of HTTP requests in-flight being terminated suddenly)
👎 doesn't allow time for in-flight requests to finish. The context is cancelled for each in-flight request as a side-effect of the way servers shutdown and a parent context is cancelled. I can't see a way to get this to work as I'd like, so opening this draft PR for suggestions or discussion.
I believe the code flow is now:
- signal received
- signal handling runs
down.Shutdown(tctx), causing each server to stop listening for new requests, and immediately returnhttp.ErrServerClosedfrom theirServe()orListenAndServe()calls - so
srvs.Wait()immediately completes for the server ErrorGroup, cancelling it's contextsrvctx srvctxis a parent of each http request's context, so each in-flight request begins context-cancellation processing- process exits when all in-flight requests end (which is unnecessarily quickly, as they're all cancelled instead of getting a chance to complete)
I tried not having all request contexts children of srvctx, but got lots of context-related warnings logged from pkg/ctxlock/ctxlock.go during graceful shutdown
In practice this changes the behaviour seen when clair does a graceful shutdown -
From:
- in-flight requests are terminated suddenly, with no content (or perhaps partial content if sending of the body had started)
- clair graceful shutdown code doesn't properly run to completion
To:
- in-flight requests have their context cancelled, processing of that has 10sec to complete. Probably a http error response and body is received
- clair graceful shutdown code runs to completion (but causes in-flight requests to be context-cancelled rather than allowing a grace period to finish successfully)
Can one of the admins verify this patch?
Example running this code
- run a clair:
go run ./cmd/clair/ -conf ./local-dev/clair/config-frostmar-local.yaml -mode indexer - start a longer-running request:
curl -H "Content-Type: application/json" http://localhost:6060/indexer/api/v1/index_report -d ...etc... - CTRL+C (SIGINT) clair
- request response is a cancellation, (instead of terminated with no body)
{"code":"internal-error","message":"failed to start scan: failed to fetch layers: encountered error while fetching a layer: context canceled"}
Clair logs:
2:59PM INF index request start component=libindex/Libindex.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b
2:59PM DBG locking attempt component=libindex/Libindex.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b
2:59PM DBG locking OK component=libindex/Libindex.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b
2:59PM INF starting scan component=indexer/controller/Controller.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b
2:59PM INF manifest to be scanned component=indexer/controller/Controller.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b state=CheckManifest
2:59PM INF layers fetch start component=indexer/controller/Controller.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b state=FetchLayers
2:59PM DBG fetching layers component=indexer/controller/Controller.Index count=1 manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b state=FetchLayers
2:59PM DBG layer fetch start arena=/var/folders/fh/03vck1t10pb340wx73411k7w0000gn/T/ component=libindex/fetchArena.realizeLayer layer=sha256:60e3fbbba55b85dbc675e512ce2c3c8dd658550cb516ac64cdc74b75f90805cd manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b state=FetchLayers uri=http://localhost:8087/blobs/sha256:60e3fbbba55b85dbc675e512ce2c3c8dd658550cb516ac64cdc74b75f90805cd
^C
2:59PM INF Received signal 'interrupt', gracefully shutting down component=main
2:59PM WRN layers fetch failure error="encountered error while fetching a layer: context canceled" component=indexer/controller/Controller.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b state=FetchLayers
2:59PM INF layers fetch done component=indexer/controller/Controller.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b state=FetchLayers
2:59PM INF index request done component=libindex/Libindex.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b
2:59PM DBG http error response error="failed to start scan: failed to fetch layers: encountered error while fetching a layer: context canceled" code=500 component=httptransport/New request_id=ea8395ae99607f8b
2:59PM INF handled HTTP request component=httptransport/New duration=3817.390542 method=POST remote_addr=[::1]:60510 request_id=ea8395ae99607f8b request_uri=/indexer/api/v1/index_report status=500
2:59PM INF successfully shut down servers component=main
Should be fixed via #2015