temporal
temporal copied to clipboard
Support log-less graceful shutdown without "Error looking up host for shardID" errors
trafficstars
Is your feature request related to a problem? Please describe.
Today shutting down a single-binary server gives something like:
2023-05-11T14:45:49.683-0700 ERROR Error looking up host for shardID {"component": "shard-controller", "address": "127.0.0.1:33401", "error": "Not enough hosts to serve the request", "operation-result": "OperationFailed", "shard-id": 1, "logging-call-at": "controller_impl.go:387"}
go.temporal.io/server/common/log.(*zapLogger).Error
/home/runner/go/pkg/mod/go.temporal.io/[email protected]/common/log/zap_logger.go:150
go.temporal.io/server/service/history/shard.(*ControllerImpl).acquireShards.func2
/home/runner/go/pkg/mod/go.temporal.io/[email protected]/service/history/shard/controller_impl.go:387
go.temporal.io/server/service/history/shard.(*ControllerImpl).acquireShards.func3
/home/runner/go/pkg/mod/go.temporal.io/[email protected]/service/history/shard/controller_impl.go:427
2023-05-11T14:45:50.685-0700 WARN Failed to poll for task. {"service": "worker", "Namespace": "temporal-system", "TaskQueue": "temporal-sys-tq-scanner-taskqueue-0", "WorkerID": "431854@monolith@", "WorkerType": "WorkflowWorker", "Error": "error reading from server: EOF", "logging-call-at": "internal_worker_base.go:308"}
2023-05-11T14:45:50.685-0700 WARN Failed to poll for task. {"service": "worker", "Namespace": "temporal-system", "TaskQueue": "temporal-sys-processor-parent-close-policy", "WorkerID": "431854@monolith@", "WorkerType": "WorkflowWorker", "Error": "error reading from server: EOF", "logging-call-at": "internal_worker_base.go:308"}
2023-05-11T14:45:50.685-0700 WARN Failed to poll for task. {"service": "worker", "Namespace": "temporal-system", "TaskQueue": "temporal-sys-history-scanner-taskqueue-0", "WorkerID": "431854@monolith@", "WorkerType": "ActivityWorker", "Error": "error reading from server: EOF", "logging-call-at": "internal_worker_base.go:308"}
2023-05-11T14:45:50.685-0700 WARN Failed to poll for task. {"service": "worker", "Namespace": "temporal-system", "TaskQueue": "temporal-sys-tq-scanner-taskqueue-0", "WorkerID": "431854@monolith@", "WorkerType": "ActivityWorker", "Error": "error reading from server: EOF", "logging-call-at": "internal_worker_base.go:308"}
2023-05-11T14:45:50.685-0700 WARN Failed to poll for task. {"service": "worker", "Namespace": "temporal-system", "TaskQueue": "temporal-sys-batcher-taskqueue", "WorkerID": "431854@monolith@", "WorkerType": "ActivityWorker", "Error": "error reading from server: EOF", "logging-call-at": "internal_worker_base.go:308"}
2023-05-11T14:45:50.685-0700 WARN Failed to poll for task. {"service": "worker", "Namespace": "temporal-system", "TaskQueue": "temporal-sys-history-scanner-taskqueue-0", "WorkerID": "431854@monolith@", "WorkerType": "WorkflowWorker", "Error": "error reading from server: EOF", "logging-call-at": "internal_worker_base.go:308"}
Describe the solution you'd like
Any solution that does not make a user think there is an error when they see logs. If this needs to be only half-done here and then something done at https://github.com/temporalio/cli to properly shutdown or swallow or something, no prob.
Also, sometimes you get:
2023-10-05T11:58:24.496+0300 ERROR Unable to call matching.PollActivityTaskQueue. {"service": "frontend", "wf-task-queue-name": "/_sys/temporal-sys-history-scanner-taskqueue-0/2", "timeout": "1m9.999694s", "error": "error reading from server: EOF", "logging-call-at": "workflow_handler.go:1133"}
With a full trace.
And by "full trace", @cretz means a V E R Y full trace 😅