aws-sdk-cpp icon indicating copy to clipboard operation
aws-sdk-cpp copied to clipboard

InitAPI+ShutdownAPI+InitAPI crashes

Open jrosdahl opened this issue 3 years ago • 3 comments

Describe the bug

I have a test suite where each test calls Aws::InitAPI, then executes a test and finally calls Aws::ShutdownAPI. The second test then crashes in Aws::InitAPI. This problem started occurring when upgrading aws-sdk-cpp from version 1.8.186 to 1.9.352.

This looks similar to #2102, but that issue is closed and I don't quite understand what the reporter's solution mentioned in https://github.com/aws/aws-sdk-cpp/issues/2102#issuecomment-1254440532 would be.

Expected Behavior

A second Aws::InitAPI does not crash after Aws::InitAPI + Aws::ShutdownAPI.

Current Behavior

A second Aws::InitAPI crashes after Aws::InitAPI + Aws::ShutdownAPI.

Reproduction Steps

Compile and run the following program, linking against aws-sdk-cpp 1.9.352:

#include <aws/core/Aws.h>

int main()
{
  Aws::SDKOptions options;

  Aws::InitAPI(options);
  Aws::ShutdownAPI(options);

  Aws::InitAPI(options); // line 10
  Aws::ShutdownAPI(options);
}

The program will crash at line 10.

GDB backtrace:

(gdb) bt
#0  0x00007ffff791166f in s2n_x509_trust_store_wipe (store=0x118) at crt/aws-crt-cpp/crt/s2n/tls/s2n_x509_validator.c:135
#1  0x00007ffff78e0988 in s2n_config_cleanup (config=0x0) at crt/aws-crt-cpp/crt/s2n/tls/s2n_config.c:125
#2  0x00007ffff78e1373 in s2n_config_free (config=0x0) at crt/aws-crt-cpp/crt/s2n/tls/s2n_config.c:360
#3  0x00007ffff78bfa84 in s_s2n_ctx_destroy (s2n_ctx=0x82ba20) at crt/aws-crt-cpp/crt/aws-c-io/source/s2n/s2n_tls_channel_handler.c:1292
#4  0x00007ffff78c0bd1 in s_tls_ctx_new (alloc=0x7ffff7dd1620 <Aws::get_aws_allocator()::wrapper>, options=0x7fffffffbb68, mode=S2N_CLIENT) at crt/aws-crt-cpp/crt/aws-c-io/source/s2n/s2n_tls_channel_handler.c:1613
#5  0x00007ffff78c0c3c in aws_tls_client_ctx_new (alloc=0x7ffff7dd1620 <Aws::get_aws_allocator()::wrapper>, options=0x7fffffffbb68) at crt/aws-crt-cpp/crt/aws-c-io/source/s2n/s2n_tls_channel_handler.c:1625
#6  0x00007ffff783b758 in Aws::Crt::Io::TlsContext::TlsContext (this=0x7fffffffbcd0, options=..., mode=Aws::Crt::Io::TlsMode::CLIENT, allocator=0x7ffff7dd1620 <Aws::get_aws_allocator()::wrapper>) at crt/aws-crt-cpp/source/io/TlsOptions.cpp:423
#7  0x00007ffff773ebe2 in Aws::InitAPI (options=...) at aws-cpp-sdk-core/source/Aws.cpp:79
#8  0x0000000000400a2a in main () at reproduce.cpp:10

Valgrind output:

==176222== Memcheck, a memory error detector
==176222== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==176222== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==176222== Command: ./reproduce
==176222== 
==176222== Invalid read of size 8
==176222==    at 0x52E166F: s2n_x509_trust_store_wipe (s2n_x509_validator.c:135)
==176222==    by 0x52B0987: s2n_config_cleanup (s2n_config.c:125)
==176222==    by 0x52B1372: s2n_config_free (s2n_config.c:360)
==176222==    by 0x528FA83: s_s2n_ctx_destroy (s2n_tls_channel_handler.c:1292)
==176222==    by 0x5290BD0: s_tls_ctx_new (s2n_tls_channel_handler.c:1613)
==176222==    by 0x5290C3B: aws_tls_client_ctx_new (s2n_tls_channel_handler.c:1625)
==176222==    by 0x520B757: Aws::Crt::Io::TlsContext::TlsContext(Aws::Crt::Io::TlsContextOptions&, Aws::Crt::Io::TlsMode, aws_allocator*) (TlsOptions.cpp:423)
==176222==    by 0x510EBE1: Aws::InitAPI(Aws::SDKOptions const&) (Aws.cpp:79)
==176222==    by 0x400A29: main (reproduce.cpp:10)
==176222==  Address 0x118 is not stack'd, malloc'd or (recently) free'd
==176222== 
==176222== 
==176222== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==176222==  Access not within mapped region at address 0x118
==176222==    at 0x52E166F: s2n_x509_trust_store_wipe (s2n_x509_validator.c:135)
==176222==    by 0x52B0987: s2n_config_cleanup (s2n_config.c:125)
==176222==    by 0x52B1372: s2n_config_free (s2n_config.c:360)
==176222==    by 0x528FA83: s_s2n_ctx_destroy (s2n_tls_channel_handler.c:1292)
==176222==    by 0x5290BD0: s_tls_ctx_new (s2n_tls_channel_handler.c:1613)
==176222==    by 0x5290C3B: aws_tls_client_ctx_new (s2n_tls_channel_handler.c:1625)
==176222==    by 0x520B757: Aws::Crt::Io::TlsContext::TlsContext(Aws::Crt::Io::TlsContextOptions&, Aws::Crt::Io::TlsMode, aws_allocator*) (TlsOptions.cpp:423)
==176222==    by 0x510EBE1: Aws::InitAPI(Aws::SDKOptions const&) (Aws.cpp:79)
==176222==    by 0x400A29: main (reproduce.cpp:10)

Possible Solution

From the backtrace, it seems incorrect for s_s2n_ctx_destroy (frame #⁠3) to pass a null config to s2n_config_free, given that s2n_config_free doesn't perform a null check. But how come the config is null and the init code path is calling cleanup functions?

As far as I can tell, this is what happens:

  1. The first Aws::InitAPI calls Aws::InitializeCrtAws::New<Aws::Crt::ApiHandle>Aws::Crt::ApiHandle::ApiHandleAws::Crt::s_initApiaws_mqtt_library_initaws_io_library_initaws_tls_init_static_states2n_disable_atexit, which sets atexit_cleanup = false.
  2. aws_tls_init_static_state then calls s2n_init, which sets initialized = true.
  3. The first Aws::ShutdownAPI calls Aws::CleanupCrtAws::Delete<Aws::Crt::ApiHandle>Aws::Crt::ApiHandle::~ApiHandleaws_s3_library_clean_upaws_http_library_clean_upaws_io_library_clean_upaws_tls_clean_up_static_states2n_cleanup. initialized is still true and atexit_cleanup is still false.
  4. The second Aws::InitAPI calls Aws::InitializeCrtAws::Crt::ApiHandle::ApiHandleAws::Crt::s_initApiaws_mqtt_library_initaws_io_library_initaws_tls_init_static_states2n_disable_atexit, which fails since initialized is true.
  5. aws_tls_init_static_state sets s_s2n_initialized_externally = true and therefore doesn't call s2n_init.
  6. Aws::InitAPIAws::Crt::Io::TlsContext::TlsContextaws_tls_client_ctx_news_tls_ctx_news2n_config_new fails since s2n_init hasn't been called.
  7. s_s2n_ctx_destroy is called due to the above failure.
  8. s2n_config_free is not prepared for freeing a half-initialized s2n_ctx.

I haven't verified this, but from code inspection I think that the regression might have been introduced in https://github.com/awslabs/aws-c-io/commit/4e462a0687d00d18cfdece3cd7a427d8a9b2e13d since before that change the return value of s2n_disable_atexit did not affect whether aws_tls_init_static_state calls s2n_init.

To me it looks like the problem is that Aws::ShutdownAPI doesn't fully clean up the s2n state, so the fix I came up with is to let s2n_cleanup_atexit_impl reset initialized and atexit_cleanup to their original values:

--- a/crt/aws-crt-cpp/crt/s2n/utils/s2n_init.c
+++ b/crt/aws-crt-cpp/crt/s2n/utils/s2n_init.c
@@ -88,6 +88,9 @@ static bool s2n_cleanup_atexit_impl(void)
     /* the configs need to be wiped before resetting the memory callbacks */
     s2n_wipe_static_configs();
 
+    initialized = false;
+    atexit_cleanup = true;
+
     return s2n_result_is_ok(s2n_rand_cleanup_thread()) &&
            s2n_result_is_ok(s2n_rand_cleanup()) &&
            s2n_result_is_ok(s2n_locking_cleanup()) &&

Additional Information/Context

No response

AWS CPP SDK version used

1.9.352

Compiler and Version used

gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)

Operating System and version

CentOS 7.9.2009

jrosdahl avatar Sep 28 '22 08:09 jrosdahl

Got the very same problem, was just going to open an issue

Same backtraces, same conclusions on investigation: problem is initialized is true in s2n_init.c at the moment of re-initialization. Didn't try to fix it though

EgorkaZ avatar Sep 28 '22 08:09 EgorkaZ

Trying to reproduce this, but I'm not seeing any crash on Mac. @EgorkaZ which OS and SDK version are you seeing this on?

jmklix avatar Sep 29 '22 00:09 jmklix

OS: Ubuntu 20.04 SDK: 1.9.342

Can try it on Mac. AFAIR the code, it should reproduce there too. @jmklix

EgorkaZ avatar Sep 29 '22 15:09 EgorkaZ

We have fixed some memory leaks and changed the build process since this was opened. Can you confirm if you are still seeing this crash?

jmklix avatar Mar 23 '23 21:03 jmklix

We have fixed some memory leaks and changed the build process since this was opened. Can you confirm if you are still seeing this crash?

I see no crash with version 1.10.49.

jrosdahl avatar Mar 24 '23 08:03 jrosdahl

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

github-actions[bot] avatar Mar 24 '23 17:03 github-actions[bot]