aws-sdk-cpp icon indicating copy to clipboard operation
aws-sdk-cpp copied to clipboard

Aws::Crt::Io::ClientBootstrap destructor may launch thread at process exit, and crash

Open pitrou opened this issue 4 years ago • 12 comments
trafficstars

Confirm by changing [ ] to [x] below to ensure that it's a bug:

Describe the bug We're getting a user report of a crash, seemingly at process shutdown, on Windows: https://github.com/conda-forge/arrow-cpp-feedstock/issues/567

Apparently the ClientBootstrap destructor can indirectly trigger the launch of a new thread using aws_thread_launch. The thread launch fails at process shutdown, at least on Windows, triggering an assertion error and therefore a process crash.

SDK version number 1.9.120

Platform/OS/Hardware/Device Windows/10.0.17763 (also reported on CentOS 8 and Ubuntu: https://issues.apache.org/jira/browse/ARROW-15141)

To Reproduce (observed behavior) Basically https://github.com/conda-forge/arrow-cpp-feedstock/issues/567#issue-1047929850, but I'm not sure what the exact steps are (I'm not the original reporter).

Expected behavior Failing to launch a thread at process shutdown should probably not crash the process.

pitrou avatar Nov 09 '21 15:11 pitrou

@xhochy

pitrou avatar Nov 09 '21 16:11 pitrou

Does not happen with SDK version 1.8.186.

kylekeppler avatar Nov 10 '21 16:11 kylekeppler

In case it could help narrow down the source of the bug, I tested a few different versions of aws-sdk-cpp on CentOS 7:

  • No issue: 1.1.186
  • Bug: 1.9.120, 1.9.140

jdblischak avatar Dec 03 '21 16:12 jdblischak

Hi @jdblischak , Can you share how you are reproducing this? In the post linked by the op there seems to be a fix provided by conda-forge so I'm wondering if this is on their side rather than the sdk?

KaibaLopez avatar Dec 15 '21 18:12 KaibaLopez

The fix on the conda-forge side was to revert back to the 1.8.186 SDK version. With the current issue, we cannot use a newer SDK on Windows.

xhochy avatar Dec 15 '21 18:12 xhochy

@KaibaLopez Thanks for following up

In the post linked by the op there seems to be a fix provided by conda-forge so I'm wondering if this is on their side rather than the sdk?

As @xhochy commented, the conda-forge workaround is to pin to an older version of aws-sdk-cpp. Personally I fixed it by specifying aws-sdk-cpp=1.8.186=h9ad65fb_2 for my conda env.

Can you share how you are reproducing this?

I was able to reproduce the bug using the code below:

mamba create -n test-aws python=3.9 pandas=1.2 pyarrow=2.0 aws-sdk-cpp=1.9.120
conda activate test-aws
python test-arrow.py

where test-arrow.py is the reproducible example script copied from https://github.com/conda-forge/arrow-cpp-feedstock/issues/567

import numpy as np
import pandas as pd

def test_error():

    df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

    df.to_parquet('test.parquet')

if __name__ == '__main__':
    test_error()

Here is the full error and traceback that I observe:

% python test-arrow.py
Fatal error condition occurred in /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) [0x2aaac1581f19]
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) [0x2aaac1573098]
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43) [0x2aaac17bca43]
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x2aaac1583fad]
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a) [0x2aaac17ba35a]
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x2aaac1583fad]
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) [0x2aaac1526f5a]
~/mambaforge/envs/test-aws/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) [0x2aaac0faa570]
/lib64/libc.so.6(+0x39c99) [0x2aaaab835c99]
/lib64/libc.so.6(+0x39ce7) [0x2aaaab835ce7]
/lib64/libc.so.6(__libc_start_main+0xfc) [0x2aaaab81e50c]
python(+0x20aa51) [0x55555575ea51]
Aborted

jdblischak avatar Dec 15 '21 21:12 jdblischak

Update: this error has been reported on Ubuntu and CentOS as well: https://issues.apache.org/jira/browse/ARROW-15141

pitrou avatar Dec 16 '21 22:12 pitrou

@ihnorton have you run into something similar for tiledb?

jeroen avatar Jan 06 '22 17:01 jeroen

No, we are still on 1.8, and that backtrace does not ring a bell.

ihnorton avatar Jan 06 '22 19:01 ihnorton

Hi. Just to say we have hit this exact same issue using v1.9.72 within our in-house build at MathWorks.

asp200 avatar Jan 17 '22 14:01 asp200

We also see this on windows now (while updating).

  • @KaibaLopez the code in question is here:

https://github.com/awslabs/aws-c-io/blob/b5cad3d21018e84a5084d6e191661fa604b49f0c/source/event_loop.c#L73-L75

  • aws_thread_launch uses the win32 CreateThread API:

https://github.com/awslabs/aws-c-common/blob/cba230815132f53206c501874e03a286765fb225/source/windows/thread.c#L258-L259

  • it is documented here that CreateThread is not valid when the process is exiting:

The ExitProcess, ExitThread, CreateThread, CreateRemoteThread functions, and a process that is starting (as the result of a call by CreateProcess) are serialized between each other within a process. Only one of these events can happen in an address space at a time. This means that the following restrictions hold:

  • you can see a backtrace here where this error message happens after ExitProcess has been called

ihnorton avatar Jul 26 '22 02:07 ihnorton

Was this resolved by https://github.com/awslabs/aws-c-io/pull/515?

kkraus14 avatar Oct 17 '22 15:10 kkraus14

Was this resolved by awslabs/aws-c-io#515?

I don't know this repo, but from looking around, it seems that:

  • the referenced PR is part of aws-c-io 0.13.5
  • latest aws-sdk-cpp release seems to use aws-c-io 0.10.20

The last update of the aws-c-io version used in this repo was in June (NB: at which point 5 newer releases of aws-c-io would have been already available). Perhaps @sdavtaker (author of last update) can illuminate the process of what's necessary to update the respective dependencies.

This issue is biting us quite hard in conda-forge, made worse by the fact that aws-sdk-cpp 1.8 does not seem compatible anymore with current versions of the rest of the aws-c-* stack (which we need to unbundle for several reasons).

I also noticed still regular discussions about this problem in other repos, e.g. https://github.com/huggingface/datasets/issues/3310

Furthermore: This bug also happens outside pyarrow, I incorporate AWS in a standalone Windows C-program and that crashes during exit.

So it would be really good if we could upgrade aws-c-io here and then determine if that actually fixes things...

h-vetinari avatar Oct 25 '22 01:10 h-vetinari

pyarrow 10.0.1 was just released in conda-forge, which is the first release where we're building against aws-sdk-cpp 1.9.* again after more than a year. Since we cannot test the failure reported here on our infra, I'd be very grateful if someone could verify that the problem does or doesn't reappear. 🙃

conda install -c conda-forge pyarrow=10

Edit: if things are fine, I'm happy to backport this to arrow 6.x-9.x.

h-vetinari avatar Dec 04 '22 09:12 h-vetinari

Confirmed. Thanks @h-vetinari! See reproducible example at https://github.com/conda-forge/arrow-cpp-feedstock/issues/567#issuecomment-1344764356

jdblischak avatar Dec 09 '22 20:12 jdblischak

In case someone else is still facing it...

I had the same issue, but it was caused because Aws::ShutdownAPI was not being called correctly.

https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/basic-use.html

cardinotGV avatar May 12 '23 19:05 cardinotGV

As @cardinotGV said please make sure that you are calling InitAPI and ShutdownAPI correctly:

#include <aws/core/Aws.h>
int main(int argc, char** argv)
{
   Aws::SDKOptions options;
   Aws::InitAPI(options);
   {
      // make your SDK calls here.
   }
   Aws::ShutdownAPI(options);
   return 0;
}

If you are still running into any crashes at process exit please let me know

jmklix avatar Sep 18 '23 23:09 jmklix

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

github-actions[bot] avatar Sep 18 '23 23:09 github-actions[bot]