Issues icon indicating copy to clipboard operation
Issues copied to clipboard

Health Check Scripts can sometimes become hung indefinitely on a tentacle, blocking deployments from running.

Open LukeButters opened this issue 2 years ago • 0 comments

Severity

No response

Version

Seen on 2023.2.3264

Latest Version

None

What happened?

Very occasionally health check powershell scripts became hung and do not progress, preventing deployment to that target. Until the health check to be manually cancelled.

Reproduction

Not possible

Error and Stacktrace

07:19:03   Verbose  |     Performing health check on machine
07:19:03   Verbose  |     Acquiring isolation mutex RunningScript with NoIsolation in ServerTasks-81966
07:19:03   Verbose  |     Executable directory is C:\Windows\system32\WindowsPowershell\v1.0
07:19:03   Verbose  |     Executable name or full path: C:\Windows\system32\WindowsPowershell\v1.0\PowerShell.exe
07:19:03   Verbose  |     No user context provided. Running as current user.
07:19:03   Verbose  |     Starting C:\Windows\system32\WindowsPowershell\v1.0\PowerShell.exe in working directory 'D:\Octopus\Work\123' using 'redacted' encoding running as 'redacted' with the same environment variables as the launching process
07:19:07   Verbose  |     Process C:\Windows\system32\WindowsPowershell\v1.0\PowerShell.exe in D:\Octopus\Work\123 exited with code 0
07:19:07   Verbose  |     Exit code: 0
07:19:07   Verbose  |     Acquiring isolation mutex RunningScript with NoIsolation in ServerTasks-81966
07:19:07   Verbose  |     Executable directory is C:\Windows\system32\WindowsPowershell\v1.0
07:19:07   Verbose  |     Executable name or full path: C:\Windows\system32\WindowsPowershell\v1.0\PowerShell.exe
07:19:07   Verbose  |     No user context provided. Running as current user.
07:19:07   Verbose  |     Starting C:\Windows\system32\WindowsPowershell\v1.0\PowerShell.exe in working directory 'D:\Octopus\Work\124' using 'redacted' encoding running as 'redacted' with the same environment variables as the launching process
                    |   

No output was seen for 2 days, when the health check was cancelled.

More Information

The issue here is that:

  • The hung health check blocks deployment to the target.
  • The default health check script typically takes 1s or less to run, however it is allowed to run for days.

Also worth noting:

  • The default health check is idempotent making it a good candidate for re-trying the script when hung.

See also

  • SC-68672
  • https://github.com/OctopusDeploy/Issues/issues/8581
  • https://github.com/OctopusDeploy/Issues/issues/8118 Note that the fix in this ticket has some drawbacks:
    • Customers get alerts about "failed health checks" when really the target was healthy and we were able to communicate with it.
    • The error only occurs very occasionally so retrying is probably a more accurate representation of the state of the tentacle.
    • The fact that it cancels the health check means that we don't find out if the tentacle is in-fact in a state in which it can not run scripts and so is unhealty but left as healthy.

Workaround

Manually cancel the health check.

LukeButters avatar Jan 25 '24 00:01 LukeButters