sdk-typescript [Bug] Worker crashes with "async hook stack has become corrupted" on Workflow Task timeout

What are you really trying to do?

Tell us at a high level what you're doing, to avoid XY problem (https://en.wikipedia.org/wiki/XY_problem) We had a workflow in development where we were still developing on it. Older version of workflow is still running when we change the code of workflow and deployed it and we see a crash in the system

Describe the bug

Error: async hook stack has become corrupted (actual: 112970, expected: 3) 1: 0x10229dab8 node::AsyncHooks::FailWithCorruptedAsyncStack(double) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 2: 0x102232164 node::AsyncHooks::pop_async_context(double) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 3: 0x102231d8c node::InternalCallbackScope::Close() [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 4: 0x10223266e node::InternalMakeCallback(node::Environment*, v8::Localv8::Object, v8::Localv8::Object, v8::Localv8::Function, int, v8::Localv8::Value, node::async_context) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 5: 0x10224a238 node::AsyncWrap::MakeCallback(v8::Localv8::Function, int, v8::Localv8::Value) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 6: 0x102346af6 node::worker::MessagePort::OnMessage(node::worker::MessagePort::MessageProcessingMode) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 7: 0x102cd275b uv__async_io [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 8: 0x102ce654b uv__io_poll [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 9: 0x102cd2ce1 uv_run [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 10: 0x102232f9f node::SpinEventLoop(node::Environment*) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 11: 0x1023a9cae node::worker::Worker::Run() [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 12: 0x1023ad6a2 node::worker::Worker::StartThread(v8::FunctionCallbackInfov8::Value const&)::$_3::__invoke(void*) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 13: 0x7ff81c0e84e1 _pthread_start [/usr/lib/system/libsystem_pthread.dylib] 14: 0x7ff81c0e3f6b thread_start [/usr/lib/system/libsystem_pthread.dylib] [nodemon] app crashed - waiting for file changes before starting... ^C

Oct 28 '22 09:10 thehellmaker

Interesting.. this looks like a Node.js issue, can you try running with different Node.js versions?

Oct 28 '22 13:10 bergundy

Actually i figured out the issue. It was a bug in the worker code which lead to an infinite loop. But this should be handled better by temporal i feel with proper error messaging

My worker has a code like this

while(true) {
        // FIX: currentState = WorkflowState.COMPLETE
        await condition(() => currentState !== WorkflowState.COMPLETE);
        if (currentState === WorkflowState.REVIEWED_AND_SAVED) {
            // call an activity and
            return response
        }
}

The commented out line with // FIX: is the fix

The problem here is

current state is updated by a Signal
When the signal sets the currentState to anything other than WorkflowState.REVIEWED_AND_SAVED or WorkflowState.COMPLETE then the condition clause is always satisfied and the if condition is never satisfied
The worker now goes into infinite loop without any condition waits.

Oct 28 '22 16:10 thehellmaker

I don't know how much we can handle this since it looks like the error happens deep in Node internals. I can see why this happens though. Thanks for the report.

The SDK does protect you from tight loops by setting a timeout on workflow activations but looks like something else is going on here.

It looks like async_hooks and the vm context execution timeout clash. We should try to reproduce this and see if there's anything the SDK can do or to mitigate this but I suspect the issue is an unhandled edge case in Node.js.

Oct 28 '22 16:10 bergundy

Here is a repro without condition:

// Fails with corrupted async stack
export async function testWorkflow(): Promise<void> {
  while (true) {
    await new CancellationScope({ cancellable: true }).run(() =>
      Promise.resolve(true)
    );
  }
}

This is obviously an endless loop. I'd expect the WFT to ends with "Error: Script execution timed out after 10000ms", but instead get an async hook stack corruption, as described above.

My investigation got it down to an issue in node involving: vm + microtaskMode: 'afterEvaluate' + runInContext with timeout + AsyncLocalStorage. I know for sure that it happens after the runInContext’s timeout fires. Seems like either the AsyncLocalStorage internal state is not properly cleaned up on context timeout, or there are still some live functions (probably native code) attached to that context's AsyncLocalStorage that tries to access the storage after it has been cleaned up.

Proposed solution is to replace Node's builtin but experimental implementation of AsyncLocalStorage with our own implementation.

Jan 24 '23 20:01 mjameswh

sdk-typescript sdk-typescript copied to clipboard

[Bug] Worker crashes with "async hook stack has become corrupted" on Workflow Task timeout

What are you really trying to do?

Describe the bug

sdk-typescript
sdk-typescript copied to clipboard