nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Nomad autoscaler deployment loses canary status and gets stuck

Open sluebbert opened this issue 3 months ago • 2 comments

Nomad version

1.7.3

Operating system and Environment details

Ubuntu 22.04.2 LTS

Issue

From time to time auto scaling results in a deployment that partially fails. The deployment never completes and gets hung in a state that seems corrupt.

In the example below, the autoscaler is dropping an instance count from 5 to 2. We came across this instance an hour after it started. Given that the deployment is auto promoting and auto reverting, and that there is a tight progress deadline of 31 seconds, I don't understand why the deployment is still running and stuck hours later. The deployment starts at 08:54:13 but with events and deadlines smeared across about 4 minutes also doesn't make sense to me. I'm not sure what it was doing.

The UI overview tab shows a summary that does not include canary information: image

The UI deployment tab does show canary information: Deployments

The chart showing auto scaling behavior: image

The change recorded for the deployment: image

A request to the API to get the deployment returns:

{
        "ID": "fdf207df-adb5-ec23-52ce-fbada3fe495a",
        "Namespace": "default",
        "JobID": "invoice",
        "JobVersion": 12847,
        "JobModifyIndex": 23689764,
        "JobSpecModifyIndex": 23689764,
        "JobCreateIndex": 6028898,
        "IsMultiregion": false,
        "TaskGroups": {
            "app": {
                "AutoRevert": true,
                "AutoPromote": true,
                "ProgressDeadline": 31000000000,
                "RequireProgressBy": "2024-04-04T08:58:57.851083477-05:00",
                "Promoted": false,
                "PlacedCanaries": [
                    "638d875b-b842-10cb-34b2-3439e9753929"
                ],
                "DesiredCanaries": 1,
                "DesiredTotal": 2,
                "PlacedAllocs": 3,
                "HealthyAllocs": 2,
                "UnhealthyAllocs": 1
            }
        },
        "Status": "running",
        "StatusDescription": "Deployment is running pending automatic promotion",
        "EvalPriority": 50,
        "CreateIndex": 23689767,
        "ModifyIndex": 23693324
    }

A request to the API to get the canary allocation returns:

{
        "ID": "638d875b-b842-10cb-34b2-3439e9753929",
        "EvalID": "41fff724-6644-dfb8-5fce-43fc3b04c4c2",
        "Name": "invoice.app[0]",
        "Namespace": "default",
        "NodeID": "7098fdeb-2541-4122-eee4-ae393c4b5d38",
        "NodeName": "p-mwsv-nc04",
        "JobID": "invoice",
        "JobType": "service",
        "JobVersion": 12847,
        "TaskGroup": "app",
        "AllocatedResources": null,
        "DesiredStatus": "run",
        "DesiredDescription": "",
        "ClientStatus": "running",
        "ClientDescription": "Tasks are running",
        "DesiredTransition": {
            "Migrate": null,
            "Reschedule": true,
            "ForceReschedule": null,
            "NoShutdownDelay": null
        },
        "TaskStates": {
            "sidecar-check": {
                "State": "dead",
                "Failed": false,
                "Restarts": 0,
                "LastRestart": null,
                "StartedAt": "2024-04-04T13:55:20.546726093Z",
                "FinishedAt": "2024-04-04T13:55:53.799774611Z",
                "Events": [
                    {
                        "Type": "Received",
                        "Time": 1712238853931674722,
                        "Message": "",
                        "DisplayMessage": "Task received by client",
                        "Details": {},
                        "FailsTask": false,
                        "RestartReason": "",
                        "SetupError": "",
                        "DriverError": "",
                        "ExitCode": 0,
                        "Signal": 0,
                        "KillTimeout": 0,
                        "KillError": "",
                        "KillReason": "",
                        "StartDelay": 0,
                        "DownloadError": "",
                        "ValidationError": "",
                        "DiskLimit": 0,
                        "FailedSibling": "",
                        "VaultError": "",
                        "TaskSignalReason": "",
                        "TaskSignal": "",
                        "DriverMessage": "",
                        "GenericSource": ""
                    },
                    {
                        "Type": "Task Setup",
                        "Time": 1712238920329798253,
                        "Message": "Building Task Directory",
                        "DisplayMessage": "Building Task Directory",
                        "Details": {
                            "message": "Building Task Directory"
                        },
                        "FailsTask": false,
                        "RestartReason": "",
                        "SetupError": "",
                        "DriverError": "",
                        "ExitCode": 0,
                        "Signal": 0,
                        "KillTimeout": 0,
                        "KillError": "",
                        "KillReason": "",
                        "StartDelay": 0,
                        "DownloadError": "",
                        "ValidationError": "",
                        "DiskLimit": 0,
                        "FailedSibling": "",
                        "VaultError": "",
                        "TaskSignalReason": "",
                        "TaskSignal": "",
                        "DriverMessage": "",
                        "GenericSource": ""
                    },
                    {
                        "Type": "Started",
                        "Time": 1712238920546681038,
                        "Message": "",
                        "DisplayMessage": "Task started by client",
                        "Details": {},
                        "FailsTask": false,
                        "RestartReason": "",
                        "SetupError": "",
                        "DriverError": "",
                        "ExitCode": 0,
                        "Signal": 0,
                        "KillTimeout": 0,
                        "KillError": "",
                        "KillReason": "",
                        "StartDelay": 0,
                        "DownloadError": "",
                        "ValidationError": "",
                        "DiskLimit": 0,
                        "FailedSibling": "",
                        "VaultError": "",
                        "TaskSignalReason": "",
                        "TaskSignal": "",
                        "DriverMessage": "",
                        "GenericSource": ""
                    },
                    {
                        "Type": "Terminated",
                        "Time": 1712238953778944828,
                        "Message": "",
                        "DisplayMessage": "Exit Code: 0",
                        "Details": {
                            "signal": "0",
                            "oom_killed": "false",
                            "exit_code": "0"
                        },
                        "FailsTask": false,
                        "RestartReason": "",
                        "SetupError": "",
                        "DriverError": "",
                        "ExitCode": 0,
                        "Signal": 0,
                        "KillTimeout": 0,
                        "KillError": "",
                        "KillReason": "",
                        "StartDelay": 0,
                        "DownloadError": "",
                        "ValidationError": "",
                        "DiskLimit": 0,
                        "FailedSibling": "",
                        "VaultError": "",
                        "TaskSignalReason": "",
                        "TaskSignal": "",
                        "DriverMessage": "",
                        "GenericSource": ""
                    }
                ],
                "TaskHandle": null
            },
            "service-invoice": {
                "State": "running",
                "Failed": false,
                "Restarts": 0,
                "LastRestart": null,
                "StartedAt": "2024-04-04T13:55:54.06487478Z",
                "FinishedAt": null,
                "Events": [
                    {
                        "Type": "Received",
                        "Time": 1712238853932189345,
                        "Message": "",
                        "DisplayMessage": "Task received by client",
                        "Details": {},
                        "FailsTask": false,
                        "RestartReason": "",
                        "SetupError": "",
                        "DriverError": "",
                        "ExitCode": 0,
                        "Signal": 0,
                        "KillTimeout": 0,
                        "KillError": "",
                        "KillReason": "",
                        "StartDelay": 0,
                        "DownloadError": "",
                        "ValidationError": "",
                        "DiskLimit": 0,
                        "FailedSibling": "",
                        "VaultError": "",
                        "TaskSignalReason": "",
                        "TaskSignal": "",
                        "DriverMessage": "",
                        "GenericSource": ""
                    },
                    {
                        "Type": "Task Setup",
                        "Time": 1712238953802419893,
                        "Message": "Building Task Directory",
                        "DisplayMessage": "Building Task Directory",
                        "Details": {
                            "message": "Building Task Directory"
                        },
                        "FailsTask": false,
                        "RestartReason": "",
                        "SetupError": "",
                        "DriverError": "",
                        "ExitCode": 0,
                        "Signal": 0,
                        "KillTimeout": 0,
                        "KillError": "",
                        "KillReason": "",
                        "StartDelay": 0,
                        "DownloadError": "",
                        "ValidationError": "",
                        "DiskLimit": 0,
                        "FailedSibling": "",
                        "VaultError": "",
                        "TaskSignalReason": "",
                        "TaskSignal": "",
                        "DriverMessage": "",
                        "GenericSource": ""
                    },
                    {
                        "Type": "Started",
                        "Time": 1712238954064811992,
                        "Message": "",
                        "DisplayMessage": "Task started by client",
                        "Details": {},
                        "FailsTask": false,
                        "RestartReason": "",
                        "SetupError": "",
                        "DriverError": "",
                        "ExitCode": 0,
                        "Signal": 0,
                        "KillTimeout": 0,
                        "KillError": "",
                        "KillReason": "",
                        "StartDelay": 0,
                        "DownloadError": "",
                        "ValidationError": "",
                        "DiskLimit": 0,
                        "FailedSibling": "",
                        "VaultError": "",
                        "TaskSignalReason": "",
                        "TaskSignal": "",
                        "DriverMessage": "",
                        "GenericSource": ""
                    }
                ],
                "TaskHandle": null
            },
            "connect-proxy-invoice": {
                "State": "running",
                "Failed": false,
                "Restarts": 0,
                "LastRestart": null,
                "StartedAt": "2024-04-04T13:55:53.086854811Z",
                "FinishedAt": null,
                "Events": [
                    {
                        "Type": "Received",
                        "Time": 1712238853932586579,
                        "Message": "",
                        "DisplayMessage": "Task received by client",
                        "Details": {},
                        "FailsTask": false,
                        "RestartReason": "",
                        "SetupError": "",
                        "DriverError": "",
                        "ExitCode": 0,
                        "Signal": 0,
                        "KillTimeout": 0,
                        "KillError": "",
                        "KillReason": "",
                        "StartDelay": 0,
                        "DownloadError": "",
                        "ValidationError": "",
                        "DiskLimit": 0,
                        "FailedSibling": "",
                        "VaultError": "",
                        "TaskSignalReason": "",
                        "TaskSignal": "",
                        "DriverMessage": "",
                        "GenericSource": ""
                    },
                    {
                        "Type": "Task Setup",
                        "Time": 1712238920328470417,
                        "Message": "Building Task Directory",
                        "DisplayMessage": "Building Task Directory",
                        "Details": {
                            "message": "Building Task Directory"
                        },
                        "FailsTask": false,
                        "RestartReason": "",
                        "SetupError": "",
                        "DriverError": "",
                        "ExitCode": 0,
                        "Signal": 0,
                        "KillTimeout": 0,
                        "KillError": "",
                        "KillReason": "",
                        "StartDelay": 0,
                        "DownloadError": "",
                        "ValidationError": "",
                        "DiskLimit": 0,
                        "FailedSibling": "",
                        "VaultError": "",
                        "TaskSignalReason": "",
                        "TaskSignal": "",
                        "DriverMessage": "",
                        "GenericSource": ""
                    },
                    {
                        "Type": "Started",
                        "Time": 1712238953086798075,
                        "Message": "",
                        "DisplayMessage": "Task started by client",
                        "Details": {},
                        "FailsTask": false,
                        "RestartReason": "",
                        "SetupError": "",
                        "DriverError": "",
                        "ExitCode": 0,
                        "Signal": 0,
                        "KillTimeout": 0,
                        "KillError": "",
                        "KillReason": "",
                        "StartDelay": 0,
                        "DownloadError": "",
                        "ValidationError": "",
                        "DiskLimit": 0,
                        "FailedSibling": "",
                        "VaultError": "",
                        "TaskSignalReason": "",
                        "TaskSignal": "",
                        "DriverMessage": "",
                        "GenericSource": ""
                    }
                ],
                "TaskHandle": null
            }
        },
        "DeploymentStatus": {
            "Healthy": false,
            "Timestamp": "2024-04-04T08:54:43.935741087-05:00",
            "Canary": true,
            "ModifyIndex": 23690280
        },
        "FollowupEvalID": "",
        "NextAllocation": "",
        "RescheduleTracker": null,
        "PreemptedAllocations": null,
        "PreemptedByAllocation": "",
        "CreateIndex": 23689767,
        "ModifyIndex": 23691359,
        "CreateTime": 1712238853910271848,
        "ModifyTime": 1712238954658221350
    }

Consul shows all 3 allocations are healthy.

Reproduction steps

We have 100s of successful auto scaling deployments every day across many services. I haven't been able to recreate on demand.

Expected Result

  • Ignoring the failure since we are scaling down.
    • I'm not sure why a canary was involved in a scaling down operation. Based on spot checking, it seems like this usually isn't the case?
  • After the progress deadline has been passed, revert the deployment and mark it as failed.

Actual Result

  • A deployment that is stuck as running indefinitely blocking out future autoscaler requests and interfering with agent drains.

Nomad Client logs

Consul logs for canary alloc:

Apr 04 08:55:20 p-mwsv-nc04 consul[1051]: {"@level":"error","@message":"Request error","@module":"agent.http","@timestamp":"2024-04-04T08:55:20.485001-05:00","error":"unknown service ID: _nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy","from":"@","method":"GET","url":"/v1/agent/service/_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy"}
Apr 04 08:55:22 p-mwsv-nc04 consul[1051]: {"@level":"error","@message":"Request error","@module":"agent.http","@timestamp":"2024-04-04T08:55:22.019648-05:00","error":"unknown service ID: _nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy","from":"@","method":"GET","url":"/v1/agent/service/_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy"}
Apr 04 08:55:24 p-mwsv-nc04 consul[1051]: {"@level":"error","@message":"Request error","@module":"agent.http","@timestamp":"2024-04-04T08:55:24.198084-05:00","error":"unknown service ID: _nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy","from":"@","method":"GET","url":"/v1/agent/service/_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy"}
Apr 04 08:55:28 p-mwsv-nc04 consul[1051]: {"@level":"error","@message":"Request error","@module":"agent.http","@timestamp":"2024-04-04T08:55:28.583497-05:00","error":"unknown service ID: _nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy","from":"@","method":"GET","url":"/v1/agent/service/_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy"}
Apr 04 08:55:36 p-mwsv-nc04 consul[1051]: {"@level":"error","@message":"Request error","@module":"agent.http","@timestamp":"2024-04-04T08:55:36.674509-05:00","error":"unknown service ID: _nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy","from":"@","method":"GET","url":"/v1/agent/service/_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy"}
Apr 04 08:55:40 p-mwsv-nc04 consul[1051]: {"@level":"info","@message":"Synced service","@module":"agent","@timestamp":"2024-04-04T08:55:40.296370-05:00","service":"_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000"}
Apr 04 08:55:40 p-mwsv-nc04 consul[1051]: {"@level":"info","@message":"Synced service","@module":"agent","@timestamp":"2024-04-04T08:55:40.309195-05:00","service":"_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy"}
Apr 04 08:55:40 p-mwsv-nc04 consul[1051]: {"@level":"info","@message":"Synced check","@module":"agent","@timestamp":"2024-04-04T08:55:40.316722-05:00","check":"service:_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy:1"}
Apr 04 08:55:41 p-mwsv-nc04 consul[1051]: {"@level":"info","@message":"Synced check","@module":"agent","@timestamp":"2024-04-04T08:55:41.695784-05:00","check":"service:_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy:1"}
Apr 04 08:55:44 p-mwsv-nc04 consul[1051]: {"@level":"warn","@message":"Check TCP connection failed","@module":"agent","@timestamp":"2024-04-04T08:55:44.736246-05:00","check":"service:_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy:2","error":"dial tcp 10.20.50.159:22623: connect: connection refused"}
Apr 04 08:55:44 p-mwsv-nc04 consul[1051]: {"@level":"warn","@message":"Check is now critical","@module":"agent","@timestamp":"2024-04-04T08:55:44.736293-05:00","check":"service:_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy:2"}  
Apr 04 08:55:44 p-mwsv-nc04 consul[1051]: {"@level":"info","@message":"Synced check","@module":"agent","@timestamp":"2024-04-04T08:55:44.790898-05:00","check":"service:_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy:2"}
Apr 04 08:55:55 p-mwsv-nc04 consul[1051]: {"@level":"info","@message":"Synced check","@module":"agent","@timestamp":"2024-04-04T08:55:55.348619-05:00","check":"service:_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy:1"}
Apr 04 08:55:55 p-mwsv-nc04 consul[1051]: {"@level":"info","@message":"Synced check","@module":"agent","@timestamp":"2024-04-04T08:55:55.506208-05:00","check":"service:_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy:1"}
Apr 04 08:55:57 p-mwsv-nc04 consul[1051]: {"@level":"info","@message":"Synced check","@module":"agent","@timestamp":"2024-04-04T08:55:57.284271-05:00","check":"service:_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy:2"}
Apr 04 08:59:25 p-mwsv-nc04 consul[1051]: {"@level":"info","@message":"Synced service","@module":"agent","@timestamp":"2024-04-04T08:59:25.440198-05:00","service":"_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy"}
Apr 04 09:01:22 p-mwsv-nc04 consul[1051]: {"@level":"info","@message":"Synced service","@module":"agent","@timestamp":"2024-04-04T09:01:22.973376-05:00","service":"_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy"}
Apr 04 09:03:23 p-mwsv-nc04 consul[1051]: {"@level":"info","@message":"Synced service","@module":"agent","@timestamp":"2024-04-04T09:03:23.474654-05:00","service":"_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy"}
Apr 04 09:06:15 p-mwsv-nc04 consul[1051]: {"@level":"info","@message":"Synced service","@module":"agent","@timestamp":"2024-04-04T09:06:15.079710-05:00","service":"_nomad-task-638d875b-b842-10cb-34b2-3439e9753929-group-app-invoice-8000-sidecar-proxy"}
...

Nomad logs don't turn up anything interesting on that alloc.

sluebbert avatar Apr 04 '24 16:04 sluebbert

Hi @sluebbert and thanks for raising this issue. I am wondering if the autoscaling is a trigger or the cause here and if this has uncovered a problem within deployments themselves. It might be hard for us to reproduce, so if you uncover any additional information, please let us know as it will be very useful. If you had the autoscaler logs from around the time this scaling action was triggered, that would also be useful and help create a reproduction in the future.

jrasell avatar Apr 05 '24 07:04 jrasell

It looks like autoscaling is just a trigger. We encountered the same issue with another job this morning, but due to migrations from draining a node.

Something just seems out of sync. Here are some screenshots taken now of a "deployment" that started a little over 30 minutes ago.

Overview UI: image

Deployments UI: image

Versions UI: image

We were draining and re-enabling nodes every few minutes 30 minutes ago for other purposes which may have help get us into this situation here.

sluebbert avatar Apr 09 '24 16:04 sluebbert