aws-sdk-cpp
aws-sdk-cpp copied to clipboard
Occasional segfaults in event_loop teardown due to dead logger
Describe the bug
I am using distributed and pyarrow, which links to this SDK for s3 file access.
Using version 1.11.156, I see (intermittently, so a race condition of some kind) a segfault in the "EvntLoopCleanup" thread that AWS-SDK sets up.
See backtrace below.
Expected Behavior
No segfault
Current Behavior
Backtrace:
(gdb) info thread
Id Target Id Frame
1 Thread 0x7ffff7fe8740 (LWP 54918) "python" _dl_close_worker (map=map@entry=0x5555c05d20a0,
force=force@entry=false) at dl-close.c:794
3 Thread 0x7fff809ff700 (LWP 54931) "jemalloc_bg_thd" 0x00007ffff7bc2065 in futex_abstimed_wait_cancelable (
private=<optimized out>, abstime=0x7fff809fed70, expected=0, futex_word=0x7fff81016634)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
5 Thread 0x7fff15d9e700 (LWP 54934) "cuda-EvtHandlr" 0x00007ffff7132bb9 in __GI___poll (fds=0x55555b07c900, nfds=2,
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
10 Thread 0x7fff1559d700 (LWP 54941) "AwsEventLoop 1" 0x00007ffff713f947 in epoll_wait (epfd=37, events=0x7fff1559c950,
maxevents=100, timeout=100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
11 Thread 0x7ffed9cd2700 (LWP 54942) "cuda-EvtHandlr" 0x00007ffff7132bb9 in __GI___poll (fds=0x7ffed4000bd0, nfds=10,
timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
12 Thread 0x7ffed9270700 (LWP 54943) "python" 0x00007ffff7bc2065 in futex_abstimed_wait_cancelable (
private=<optimized out>, abstime=0x7ffed926fe80, expected=0, futex_word=0x55555b156768)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
55 Thread 0x7ffe91dda700 (LWP 54988) "jemalloc_bg_thd" 0x00007ffff7bc1ad3 in futex_wait_cancelable (
private=<optimized out>, expected=0, futex_word=0x7fff81016704) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
57 Thread 0x7ffe8f728700 (LWP 54991) "jemalloc_bg_thd" 0x00007ffff7bc1ad3 in futex_wait_cancelable (
private=<optimized out>, expected=0, futex_word=0x7fff810167d4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
* 84 Thread 0x7fff8eb84700 (LWP 55056) "EvntLoopCleanup" 0x00007fff82829b90 in Aws::Utils::Logging::s_aws_logger_redirect_get_log_level (logger=0x7fff8290b2f0 <Aws::Utils::Logging::s_sdkCrtLogger>, subject=1025)
at /usr/local/src/conda/aws-sdk-cpp-1.11.156/src/aws-cpp-sdk-core/source/utils/logging/CRTLogging.cpp:73
(gdb) bt
#0 0x00007fff82829b90 in Aws::Utils::Logging::s_aws_logger_redirect_get_log_level (
logger=0x7fff8290b2f0 <Aws::Utils::Logging::s_sdkCrtLogger>, subject=1025)
at /usr/local/src/conda/aws-sdk-cpp-1.11.156/src/aws-cpp-sdk-core/source/utils/logging/CRTLogging.cpp:73
#1 0x00007fff81e2e7d3 in s_destroy ()
from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-io.so.1.0.0
#2 0x00007fff81e28786 in s_aws_event_loop_group_shutdown_sync ()
from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-io.so.1.0.0
#3 0x00007fff81e287d9 in s_event_loop_destroy_async_thread_fn ()
from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-io.so.1.0.0
#4 0x00007fff81fa9359 in thread_fn ()
from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1
#5 0x00007ffff7bbb6db in start_thread (arg=0x7fff8eb84700) at pthread_create.c:463
#6 0x00007ffff713f61f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) p *CRTLogSystem._M_ptr
$8 = {_vptr.CRTLogSystemInterface = 0x0}
So it looks like in attempting to log something, the (static?) CRTLogSystem has already been torn down.
(I can attempt to build libaws-c-io with symbols as well to get more information).
Reproduction Steps
This is not that easy (I have not yet managed to minimise at all sorry!). Needs an NVIDIA gpu and a cuda toolkit in the host environment + nvidia container toolkit.
$ git clone https://github.com/dask/distributed
$ docker run --rm -it --gpus device=1 --network=host --volume $PWD:/workspace --workdir /workspace --entrypoint /bin/bash docker.io/gpuci/distributed:23.10-cuda11.5-devel-ubuntu18.04-py3.9
# in container
$ export PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH
$ export PARALLEL_LEVEL=${PARALLEL_LEVEL:-4}
$ cd /workspace
$ export DASK_DISTRIBUTED__ADMIN__SYSTEM_MONITOR__GIL__ENABLED=False
$ . /opt/conda/etc/profile.d/conda.sh
$ conda activate dask
$ python -m pip install -e .
$ python -m pip install git+https://github.com/dask/dask
# this might need to be run a few times
$ pytest distributed -v -m gpu --runslow -k 'ping_pong or transpose'
Possible Solution
No response
Additional Information/Context
SDK version 1.10.x doesn't exhibit this issue.
AWS CPP SDK version used
1.11.156
Compiler and Version used
g++ (conda-forge gcc 12.3.0-2) 12.3.0
Operating System and version
Ubuntu 18.04
Can you reproduce this on something other than a NIVDIA gpu? If so can you provide the detailed reproduction steps for that?
Can you reproduce this on something other than a NIVDIA gpu? If so can you provide the detailed reproduction steps for that?
I will give it a go and get back to you (it may take a little while), thanks!
@wence- has been busy with other tasks, but I finally succeeded in coming up with a reproducer that doesn't require GPUs, I'm not entirely sure the bug is the same as the stack I see is somewhat different from that in the original description:
(gdb) bt
#0 aws_mem_release (allocator=0x55c315d103b9, ptr=0x55c649b598d0) at /home/mambauser/aws-sdk-cpp/crt/aws-crt-cpp/crt/aws-c-common/source/allocator.c:210
#1 0x00007fbbf0accc55 in Aws::Crt::StlAllocator<char>::deallocate (this=0x55c649b598a0, this=0x55c649b598a0, p=<optimized out>)
at /home/mambauser/aws-sdk-cpp/crt/aws-crt-cpp/include/aws/crt/StlAllocator.h:56
#2 std::allocator_traits<Aws::Crt::StlAllocator<char> >::deallocate (__n=<optimized out>, __p=<optimized out>, __a=...)
at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/alloc_traits.h:345
#3 std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::_M_destroy (__size=<optimized out>, this=0x55c649b598a0)
at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:245
#4 std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::_M_dispose (this=0x55c649b598a0)
at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:240
#5 std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::_M_dispose (this=0x55c649b598a0)
at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:237
#6 std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::~basic_string (this=0x55c649b598a0, __in_chrg=<optimized out>)
at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:672
#7 std::default_delete<std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> > >::operator() (this=<optimized out>,
__ptr=0x55c649b598a0) at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/unique_ptr.h:85
#8 std::unique_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >, std::default_delete<std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> > > >::~unique_ptr (this=<optimized out>, __in_chrg=<optimized out>)
at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/unique_ptr.h:361
#9 0x00007fbe99af3495 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x00007fbe99af3610 in exit () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x00007fbe99ad7d97 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x00007fbe99ad7e40 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x000055c646edb421 in _start ()
To run the reproducer a system with Docker available is required, first save the snippet below named aws-segfault.dockerfile:
aws-segfault.dockerfile
FROM mambaorg/micromamba:jammy
# Install runtime, build and debug dependencies
RUN micromamba create -c conda-forge -n aws-segfault \
python=3.10 \
cuda-version=11.8 \
pyarrow=12.0.1 \
aws-sdk-cpp=1.11.156 \
pytest \
pytest-cov \
pytest-timeout \
cmake \
c-compiler \
cxx-compiler \
gcc_linux-64=11.* \
make \
gdb \
git
SHELL ["micromamba", "run", "-n", "aws-segfault", "/bin/bash", "-c"]
# Install distributed from source in developer mode
RUN cd ${HOME} \
&& git clone https://github.com/dask/distributed \
&& cd distributed \
&& git checkout 2023.9.2 \
&& pip install -e .
# Build and install aws-sdk-cpp from source with debug config
# Note: we must unset SCCACHE_BUCKET that is predefined by the base image
RUN unset SCCACHE_BUCKET \
&& cd ${HOME} \
&& git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp \
&& cd aws-sdk-cpp \
&& git checkout 1.11.181 \
&& mkdir build \
&& cd build \
&& cmake .. \
-DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_PREFIX_PATH=$CONDA_PREFIX \
-DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX \
-DBUILD_ONLY="s3" \
-DAUTORUN_UNIT_TESTS=OFF \
-DENABLE_TESTING=OFF \
&& cmake --build . --config=Debug \
&& cmake --install . --config=Debug
COPY test_aws_segfault.py /home/mambauser/distributed/test_aws_segfault.py
WORKDIR /home/mambauser/distributed
And save the following named test_aws_segfault.py to the same local directory:
test_aws_segfault.py
import pytest
pytestmark = pytest.mark.gpu
from distributed import Client
from distributed.deploy.local import LocalCluster
from distributed.utils_test import gen_test
@gen_test()
async def test_aws_segfault():
from pyarrow.fs import FSSpecHandler, PyFileSystem
da = pytest.importorskip("dask.array")
async with LocalCluster(
protocol="tcp", n_workers=2, threads_per_worker=2, dashboard_address=":0", asynchronous=True
) as cluster:
async with Client(cluster, asynchronous=True):
assert cluster.scheduler_address.startswith("tcp://")
x = da.ones((10000, 10000), chunks=(1000, 1000)).persist()
await x
y = (x + x.T).sum()
await y
With both files present in the local directory, build the docker image:
docker build -t aws-segfault:latest -f aws-segfault.dockerfile .
Then start the container:
docker run --rm -ti aws-segfault:latest /bin/bash
Inside the container run:
micromamba activate aws-segfault
python -m pytest -vs test_aws_segfault.py
# Or run with gdb
# gdb --args python -m pytest -vs test_aws_segfault.py
# run
With the instructions above I consistently got a segfault and didn't need to run multiple times to reproduce.
This should be fixed with these two PR's:
- https://github.com/aws/aws-sdk-cpp/pull/2859
- https://github.com/aws/aws-sdk-cpp/pull/2885 Please let us know if you're still running into this error.
This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.
Thanks, we'll keep an eye out.