sdk-typescript
sdk-typescript copied to clipboard
[Bug] Worker crashes with "async hook stack has become corrupted" on Workflow Task timeout
What are you really trying to do?
Tell us at a high level what you're doing, to avoid XY problem (https://en.wikipedia.org/wiki/XY_problem) We had a workflow in development where we were still developing on it. Older version of workflow is still running when we change the code of workflow and deployed it and we see a crash in the system
Describe the bug
Error: async hook stack has become corrupted (actual: 112970, expected: 3) 1: 0x10229dab8 node::AsyncHooks::FailWithCorruptedAsyncStack(double) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 2: 0x102232164 node::AsyncHooks::pop_async_context(double) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 3: 0x102231d8c node::InternalCallbackScope::Close() [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 4: 0x10223266e node::InternalMakeCallback(node::Environment*, v8::Localv8::Object, v8::Localv8::Object, v8::Localv8::Function, int, v8::Localv8::Value, node::async_context) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 5: 0x10224a238 node::AsyncWrap::MakeCallback(v8::Localv8::Function, int, v8::Localv8::Value) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 6: 0x102346af6 node::worker::MessagePort::OnMessage(node::worker::MessagePort::MessageProcessingMode) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 7: 0x102cd275b uv__async_io [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 8: 0x102ce654b uv__io_poll [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 9: 0x102cd2ce1 uv_run [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 10: 0x102232f9f node::SpinEventLoop(node::Environment*) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 11: 0x1023a9cae node::worker::Worker::Run() [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 12: 0x1023ad6a2 node::worker::Worker::StartThread(v8::FunctionCallbackInfov8::Value const&)::$_3::__invoke(void*) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 13: 0x7ff81c0e84e1 _pthread_start [/usr/lib/system/libsystem_pthread.dylib] 14: 0x7ff81c0e3f6b thread_start [/usr/lib/system/libsystem_pthread.dylib] [nodemon] app crashed - waiting for file changes before starting... ^C
Interesting.. this looks like a Node.js issue, can you try running with different Node.js versions?
Actually i figured out the issue. It was a bug in the worker code which lead to an infinite loop. But this should be handled better by temporal i feel with proper error messaging
My worker has a code like this
while(true) {
// FIX: currentState = WorkflowState.COMPLETE
await condition(() => currentState !== WorkflowState.COMPLETE);
if (currentState === WorkflowState.REVIEWED_AND_SAVED) {
// call an activity and
return response
}
}
The commented out line with // FIX: is the fix
The problem here is
- current state is updated by a Signal
- When the signal sets the currentState to anything other than WorkflowState.REVIEWED_AND_SAVED or WorkflowState.COMPLETE then the condition clause is always satisfied and the if condition is never satisfied
- The worker now goes into infinite loop without any condition waits.
I don't know how much we can handle this since it looks like the error happens deep in Node internals. I can see why this happens though. Thanks for the report.
The SDK does protect you from tight loops by setting a timeout on workflow activations but looks like something else is going on here.
It looks like async_hooks and the vm context execution timeout clash. We should try to reproduce this and see if there's anything the SDK can do or to mitigate this but I suspect the issue is an unhandled edge case in Node.js.
Here is a repro without condition
:
// Fails with corrupted async stack
export async function testWorkflow(): Promise<void> {
while (true) {
await new CancellationScope({ cancellable: true }).run(() =>
Promise.resolve(true)
);
}
}
This is obviously an endless loop. I'd expect the WFT to ends with "Error: Script execution timed out after 10000ms", but instead get an async hook stack corruption, as described above.
My investigation got it down to an issue in node involving: vm + microtaskMode: 'afterEvaluate' + runInContext with timeout + AsyncLocalStorage. I know for sure that it happens after the runInContext’s timeout fires. Seems like either the AsyncLocalStorage internal state is not properly cleaned up on context timeout, or there are still some live functions (probably native code) attached to that context's AsyncLocalStorage that tries to access the storage after it has been cleaned up.
Proposed solution is to replace Node's builtin but experimental implementation of AsyncLocalStorage with our own implementation.