curl_share_cleanup() not checked for errors
Describe the bug
On Apache Arrow we're getting Valgrind failures that happen in our Azure filesystem layer, which relies on the Azure SDK for C++.
According to Valgrind, the leak is in memory allocated by curl_share_init.
The Azure SDK does wrap the pointer returned by curl_share_init it in a smart pointer-like class CURLSHWrapper. However, the CURLSHWrapper destructor does not account for the fact that curl_share_cleanup can fail: not only according to the docs, but it can concretely return CURLSHE_IN_USE if the shared handle is still in use (by some ongoing operation perhaps?).
I don't know if that's concretely the problem in our case, but it seems the Azure SDK should at least check for errors returned by curl_share_cleanup and log an error somewhere.
Exception or Stack Trace
I'm pasting one stack trace for reference, but the other ones look similar:
==17420== 39,976 (928 direct, 39,048 indirect) bytes in 4 blocks are definitely lost in loss record 43 of 43
==17420== at 0x49C6B9F: calloc (vg_replace_malloc.c:1675)
==17420== by 0x7173253: curl_share_init (in /opt/conda/envs/arrow/lib/libcurl.so.4.8.0)
==17420== by 0x5EE4A33: Azure::Core::Http::CurlConnection::CurlConnection(Azure::Core::Http::Request&, Azure::Core::Http::CurlTransportOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::chrono::duration<long, std::ratio<1l, 1000l> >) (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x5EE729A: Azure::Core::Http::_detail::CurlConnectionPool::ExtractOrCreateCurlConnection(Azure::Core::Http::Request&, Azure::Core::Http::CurlTransportOptions const&, std::chrono::duration<long, std::ratio<1l, 1000l> >, bool) (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x5EE7B03: Azure::Core::Http::CurlTransport::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x5F12F6D: Azure::Core::Http::Policies::_internal::TransportPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x5F07BDA: Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x5F06E1D: Azure::Core::Http::Policies::_internal::LogPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x5F07BDA: Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x5F0DCAB: Azure::Core::Http::Policies::_internal::RequestActivityPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x5F07BDA: Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x4BB254D: Azure::Storage::_internal::SharedKeyPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const (in /opt/conda/envs/arrow/lib/libazure-storage-files-datalake.so)
==17420== by 0x5F07BDA: Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x64BE701: Azure::Storage::_internal::StoragePerRetryPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const (in /opt/conda/envs/arrow/lib/libazure-storage-common.so.12.10.0)
==17420== by 0x5F07BDA: Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x64BF864: Azure::Storage::_internal::StorageSwitchToSecondaryPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const (in /opt/conda/envs/arrow/lib/libazure-storage-common.so.12.10.0)
==17420== by 0x5F07BDA: Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x5F11250: Azure::Core::Http::Policies::_internal::RetryPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x5F07BDA: Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x5F11D3B: Azure::Core::Http::Policies::_internal::TelemetryPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x5F07BDA: Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x4BB2816: Azure::Core::Http::Policies::_internal::RequestIdPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const (in /opt/conda/envs/arrow/lib/libazure-storage-files-datalake.so)
==17420== by 0x5F07BDA: Azure::Core::Http::Policies::NextHttpPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Context const&) (in /opt/conda/envs/arrow/lib/libazure-core.so.1.16.0)
==17420== by 0x4BB18AF: Azure::Storage::_internal::StorageServiceVersionPolicy::Send(Azure::Core::Http::Request&, Azure::Core::Http::Policies::NextHttpPolicy, Azure::Core::Context const&) const (in /opt/conda/envs/arrow/lib/libazure-storage-files-datalake.so)
==17420== by 0x4BFDFA3: Azure::Storage::Files::DataLake::_detail::PathClient::GetAccessControlList(Azure::Core::Http::_internal::HttpPipeline&, Azure::Core::Url const&, Azure::Storage::Files::DataLake::_detail::PathClient::GetPathAccessControlListOptions const&, Azure::Core::Context const&) (in /opt/conda/envs/arrow/lib/libazure-storage-files-datalake.so)
==17420== by 0x4BC45F9: Azure::Storage::Files::DataLake::DataLakePathClient::GetAccessControlList(Azure::Storage::Files::DataLake::GetPathAccessControlListOptions const&, Azure::Core::Context const&) const (in /opt/conda/envs/arrow/lib/libazure-storage-files-datalake.so)
==17420== by 0x5BA501B: arrow::fs::internal::CheckIfHierarchicalNamespaceIsEnabled(Azure::Storage::Files::DataLake::DataLakeFileSystemClient const&, arrow::fs::AzureOptions const&) (azurefs.cc:1363)
==17420== by 0x40B6E76: arrow::fs::TestAzuriteFileSystem_CheckIfHierarchicalNamespaceIsEnabledTransportError_Test::TestBody() (azurefs_test.cc:2426)
==17420== by 0x5E9A14F: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2653)
==17420== by 0x5EA8B45: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2689)
==17420== by 0x5EA8D65: testing::Test::Run() (gtest.cc:2728)
==17420== by 0x5EA90BB: testing::TestInfo::Run() (gtest.cc:2874)
==17420== by 0x5EB1336: testing::TestSuite::Run() (gtest.cc:3052)
==17420== by 0x5EB3D8C: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:6004)
==17420== by 0x5E9A6A3: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2653)
==17420== by 0x5EA932C: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2689)
==17420== by 0x5EA94F7: testing::UnitTest::Run() (gtest.cc:5583)
==17420== by 0x4B61841: RUN_ALL_TESTS() (gtest.h:2334)
==17420== by 0x4B6187E: main (gtest_main.cc:64)
==17420==
Setup (please complete the following information):
- Azure SDK 1.16.0
Additional context Add any other context about the problem here.
Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report
- [x] Bug Description Added
- [ ] Repro Steps Added
- [x] Setup information Added
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @antkmsft @LarryOsterman @rickwinter.
I've stepped through curl_share_cleanup using gdb, and I've determined that curl_share_cleanup does indeed return CURLSHE_IN_USE in one of the affected code paths (haven't looked at the couple others). So the Azure SDK tries to close and deallocate a CURLSH that's still in use.
Note that some of the affected code paths are quite simple. The one in the stack trace above is calling Azure::Storage::Files::DataLake::DataLakePathClient::GetAccessControlList on an unresponding endpoint and just catches the Azure::Core::Http::TransportException that is thrown.