arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[CI][C++] Fix arrow-s3fs-test timeouts on macOS C++ job

Open amoeba opened this issue 1 year ago • 5 comments

Describe the bug, including details regarding any error messages, version, and platform.

The arrow-s3fs-test has been failing due to what appears to be a timeout on the "AMD64 macOS 12 C++" job for a while.

A recent example is: https://github.com/apache/arrow/actions/runs/8179809257/job/22366589254?pr=40373#step:13:265 and the relevant output is:

2024-03-06T23:05:01.0379130Z 97/97 Test #73: arrow-s3fs-test ..............................***Timeout 300.04 sec
2024-03-06T23:05:01.0381760Z Running arrow-s3fs-test, redirecting output into /Users/runner/work/arrow/arrow/build/cpp/build/test-logs/arrow-s3fs-test.txt (attempt 1/1)
2024-03-06T23:05:01.0383120Z 
2024-03-06T23:05:01.0383360Z       Start 73: arrow-s3fs-test
2024-03-06T23:10:01.0805520Z     Test #73: arrow-s3fs-test ..............................***Timeout 300.04 sec
2024-03-06T23:10:01.0808220Z Running arrow-s3fs-test, redirecting output into /Users/runner/work/arrow/arrow/build/cpp/build/test-logs/arrow-s3fs-test.txt (attempt 1/1)
2024-03-06T23:10:01.0810650Z 
2024-03-06T23:10:01.0810890Z       Start 73: arrow-s3fs-test
2024-03-06T23:15:01.1167650Z     Test #73: arrow-s3fs-test ..............................***Timeout 300.04 sec
2024-03-06T23:15:01.1171300Z Running arrow-s3fs-test, redirecting output into /Users/runner/work/arrow/arrow/build/cpp/build/test-logs/arrow-s3fs-test.txt (attempt 1/1)
2024-03-06T23:15:01.1172580Z 
2024-03-06T23:15:01.1175960Z 
2024-03-06T23:15:01.1176500Z 99% tests passed, 1 tests failed out of 97
2024-03-06T23:15:01.1176970Z 
2024-03-06T23:15:01.1177690Z Label Time Summary:
2024-03-06T23:15:01.1178690Z arrow-compute-tests    =  89.33 sec*proc (13 tests)
2024-03-06T23:15:01.1179450Z arrow-tests            = 449.67 sec*proc (41 tests)
2024-03-06T23:15:01.1183940Z arrow_acero            = 148.43 sec*proc (13 tests)
2024-03-06T23:15:01.1184610Z arrow_dataset          = 101.76 sec*proc (14 tests)
2024-03-06T23:15:01.1185230Z Errors while running CTest
2024-03-06T23:15:01.1185750Z arrow_flight           =  14.60 sec*proc (2 tests)
2024-03-06T23:15:01.1186400Z arrow_substrait        =   1.58 sec*proc (1 test)
2024-03-06T23:15:01.1187050Z filesystem             = 334.44 sec*proc (5 tests)
2024-03-06T23:15:01.1187830Z gandiva-tests          =  28.20 sec*proc (4 tests)
2024-03-06T23:15:01.1188560Z parquet-tests          =  59.08 sec*proc (9 tests)
2024-03-06T23:15:01.1189210Z unittest               = 892.66 sec*proc (97 tests)
2024-03-06T23:15:01.1189650Z 
2024-03-06T23:15:01.1189820Z Total Test time (real) = 1008.26 sec
2024-03-06T23:15:01.1190170Z 
2024-03-06T23:15:01.1190330Z The following tests FAILED:
2024-03-06T23:15:01.1190830Z 	 73 - arrow-s3fs-test (Timeout)

gist for posterity.

I expect this causes a lot of noise for maintainers so figuring out why this test is timing out would be good.

Component(s)

C++, Continuous Integration

amoeba avatar Mar 07 '24 21:03 amoeba

When running this test locally on my non-AMD64 Mac, I actually get three failures:

[ RUN      ] TestS3FS.CreateDir
/Users/bryce/src/apache/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:934: Failure
Failed
Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got OK
[  FAILED  ] TestS3FS.CreateDir (37 ms)
[ RUN      ] TestS3FSGeneric.CopyFile
Warning: The standard parity is set to 0. This can lead to data loss.
/Users/bryce/src/apache/arrow/cpp/src/arrow/filesystem/test_util.cc:570: Failure
Failed
Expected 'fs->CopyFile("AB/abc", "def/mno")' to fail with IOError, but got OK
[  FAILED  ] TestS3FSGeneric.CopyFile (66 ms)
[ RUN      ] TestS3FSGeneric.MoveFile
Warning: The standard parity is set to 0. This can lead to data loss.
/Users/bryce/src/apache/arrow/cpp/src/arrow/filesystem/test_util.cc:448: Failure
Failed
Expected 'fs->Move("AB/pqr", "xxx/mno")' to fail with IOError, but got OK
[  FAILED  ] TestS3FSGeneric.MoveFile (81 ms)

I see in this job we run ctest with,

ctest --label-regex unittest --output-on-failure --parallel 4 --repeat until-pass:3 --timeout 300

My hunch is that it's possible the test is failing and ctest is retrying. My local machine is fast enough that it can get 3 repeats in before the timeout but it's possible GitHub's runner is slow enough that the timeout happens first.

Another thing I noticed is that, when ctest automatically re-attempts the test as requested, it still prints "attempt 1/1" each time:

Running arrow-s3fs-test, redirecting output into /Users/bryce/src/apache/arrow/cpp/build/build/test-logs/arrow-s3fs-test.txt (attempt 1/1)

amoeba avatar Mar 07 '24 21:03 amoeba

I reverted my minio and mc binaries to the ones we pin for the tests and the above failures go away so I don't think that's what's going on. Though the above errors may be something to be aware of we ever need to upgrade minio/mc.

amoeba avatar Mar 07 '24 22:03 amoeba

Relevant link -> https://github.com/apache/arrow/pull/34671

felipecrv avatar Mar 07 '24 22:03 felipecrv

I haven't been able to reproduce locally (without modifying anything) but I have been able to reproduce once on CI with a stripped-down workflow in this run https://github.com/amoeba/arrow/actions/runs/8194916281/job/22416431654 so I'm going to try debugging that with tmate.

amoeba avatar Mar 08 '24 00:03 amoeba

Good luck with that, my attempts at reproducing while logging in using tmate actually failed here: https://github.com/apache/arrow/pull/40401

pitrou avatar Mar 10 '24 10:03 pitrou

I'm having success reproducing the timeout by SSH'ing into GHA so this has now turned into me figuring out how to get useful information out of the hanging test with dtrace. A did a bit of searching to find out how to get thread stack traces from a running process and ended up trying sudo dtrace -p $PID -n 'profile-997 /pid == '$PID'/ { @[ustack(100)] = count();}' -o somefile which I don'k think is close but not quite what I'm after. It behaves like it captures just a single stack trace so I'm going to keep experimenting.

But so far, it produces this more often than not, which is interesting,

libsystem_kernel.dylib`__pipe+0xa
libcurl.4.dylib`curl_easy_perform+0x77
libaws-cpp-sdk-core.dylib`Aws::Http::CurlHttpClient::MakeRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const+0xda6
libaws-cpp-sdk-core.dylib`std::__1::__function::__func<Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const::$_24, std::__1::allocator<Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const::$_24>, std::__1::shared_ptr<Aws::Http::HttpResponse> ()>::operator()()+0x26
libaws-cpp-sdk-core.dylib`std::__1::shared_ptr<Aws::Http::HttpResponse> smithy::components::tracing::TracingUtils::MakeCallWithTiming<std::__1::shared_ptr<Aws::Http::HttpResponse> >(std::__1::function<std::__1::shared_ptr<Aws::Http::HttpResponse> ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, smithy::components::tracing::Meter const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > >&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0x48
libaws-cpp-sdk-core.dylib`Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const+0x779
libaws-cpp-sdk-core.dylib`Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const+0x43b
libaws-cpp-sdk-core.dylib`Aws::Client::AWSXMLClient::MakeRequest(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const+0x42
libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::S3Client::GetBucketRegion(Aws::S3::Model::HeadBucketRequest const&)+0xf2
libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::S3Client::GetBucketRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0xb8
libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::RegionResolver::ResolveRegionUncached(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0xda
libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::RegionResolver::ResolveRegion(d::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0x12c
libarrow.1600.0.0.dylib`arrow::fs::ResolveS3BucketRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+0x265a
libarrow_gtestd.1.11.0.dylib`void testing::internal::HandleSehExceptionsInMethodIfSuppord<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x
libarrow_gtestd.1.11.0.dylib`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x6a
libarrow_gtestd.1.11.0.dylib`testing::TestInfo::Run()+0xlibarrow_gtestd.1.11.0.dylib`testing::TestSuite::Run()+0x104

amoeba avatar Mar 14 '24 19:03 amoeba

More easily perhaps, you can use gdb -p to take control of a running process and debug it. lldb has something similar.

pitrou avatar Mar 14 '24 21:03 pitrou

Thanks. A bt all gives something similar. In an attempt to catch when the test was approaching timeout, I got a baseline for how long the test takes to complete normally which was pretty consistently ~25s. I ran the test by itself until it took longer and waited until ~60s and attached using lldb -p and then ran bt all:

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007ff8179a40aa libsystem_kernel.dylib`__ulock_wait + 10
    frame #1: 0x00007ff8179e0c95 libsystem_pthread.dylib`_pthread_join + 358
    frame #2: 0x00007ff82c299cda libcurl.4.dylib`Curl_thread_join + 23
    frame #3: 0x00007ff82c28c45c libcurl.4.dylib`Curl_resolver_kill + 41
    frame #4: 0x00007ff82c2c21a0 libcurl.4.dylib`multi_done + 75
    frame #5: 0x00007ff82c2c5876 libcurl.4.dylib`multi_handle_timeout + 291
    frame #6: 0x00007ff82c2c4254 libcurl.4.dylib`multi_runsingle + 3542
    frame #7: 0x00007ff82c2c3391 libcurl.4.dylib`curl_multi_perform + 212
    frame #8: 0x00007ff82c29e7eb libcurl.4.dylib`curl_easy_perform + 299
    frame #9: 0x0000000102139df2 libaws-cpp-sdk-core.dylib`Aws::Http::CurlHttpClient::MakeRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const + 3494
    frame #10: 0x000000010212b61c libaws-cpp-sdk-core.dylib`std::__1::__function::__func<Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const::$_24, std::__1::allocator<Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const::$_24>, std::__1::shared_ptr<Aws::Http::HttpResponse> ()>::operator()() + 38
    frame #11: 0x0000000102116e78 libaws-cpp-sdk-core.dylib`std::__1::shared_ptr<Aws::Http::HttpResponse> smithy::components::tracing::TracingUtils::MakeCallWithTiming<std::__1::shared_ptr<Aws::Http::HttpResponse> >(std::__1::function<std::__1::shared_ptr<Aws::Http::HttpResponse> ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, smithy::components::tracing::Meter const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > >&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 72
    frame #12: 0x0000000102113883 libaws-cpp-sdk-core.dylib`Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const + 1913
    frame #13: 0x00000001021116af libaws-cpp-sdk-core.dylib`Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const + 1083
    frame #14: 0x000000010211c612 libaws-cpp-sdk-core.dylib`Aws::Client::AWSXMLClient::MakeRequest(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const + 66
    frame #15: 0x0000000120bbb4d2 libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::S3Client::GetBucketRegion(Aws::S3::Model::HeadBucketRequest const&) + 242
    frame #16: 0x0000000120bbb368 libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::S3Client::GetBucketRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 184
    frame #17: 0x0000000120bbac9a libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::RegionResolver::ResolveRegionUncached(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 218
    frame #18: 0x0000000120b56d3c libarrow.1600.0.0.dylib`arrow::fs::(anonymous namespace)::RegionResolver::ResolveRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 300
    frame #19: 0x0000000120b49ab5 libarrow.1600.0.0.dylib`arrow::fs::ResolveS3BucketRegion(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 613
    frame #20: 0x0000000100c7cfd4 arrow-s3fs-test`arrow::fs::S3RegionResolutionTest_NonExistentBucket_Test::TestBody() + 68
    frame #21: 0x00000001026db47b libarrow_gtestd.1.11.0.dylib`void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 123
    frame #22: 0x000000010269dcba libarrow_gtestd.1.11.0.dylib`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 106
    frame #23: 0x000000010269dc03 libarrow_gtestd.1.11.0.dylib`testing::Test::Run() + 195
    frame #24: 0x000000010269ed81 libarrow_gtestd.1.11.0.dylib`testing::TestInfo::Run() + 241
    frame #25: 0x000000010269fe24 libarrow_gtestd.1.11.0.dylib`testing::TestSuite::Run() + 260
    frame #26: 0x00000001026af16b libarrow_gtestd.1.11.0.dylib`testing::internal::UnitTestImpl::RunAllTests() + 987
    frame #27: 0x00000001026e04bb libarrow_gtestd.1.11.0.dylib`bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 123
    frame #28: 0x00000001026aeb2a libarrow_gtestd.1.11.0.dylib`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 106
    frame #29: 0x00000001026aea15 libarrow_gtestd.1.11.0.dylib`testing::UnitTest::Run() + 197
    frame #30: 0x00000001011e0f21 libarrow_gtest_maind.1.11.0.dylib`RUN_ALL_TESTS() + 17
    frame #31: 0x00000001011e0efd libarrow_gtest_maind.1.11.0.dylib`main + 61
    frame #32: 0x000000010afe352e dyld`start + 462
  thread #2
    frame #0: 0x00007ff8179a401a libsystem_kernel.dylib`__workq_kernreturn + 10
    frame #1: 0x00007ff8179dc034 libsystem_pthread.dylib`_pthread_wqthread + 426
    frame #2: 0x00007ff8179daf57 libsystem_pthread.dylib`start_wqthread + 15
  thread #3, name = 'AwsEventLoop 1'
    frame #0: 0x00007ff8179a730e libsystem_kernel.dylib`kevent + 10
    frame #1: 0x0000000101fe99ec libaws-c-io.1.0.0.dylib`aws_event_loop_thread + 404
    frame #2: 0x00000001024ffd3d libaws-c-common.1.0.0.dylib`thread_fn + 345
    frame #3: 0x00007ff8179df4e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #4: 0x00007ff8179daf6b libsystem_pthread.dylib`thread_start + 15
  thread #4
    frame #0: 0x00007ff8179a730e libsystem_kernel.dylib`kevent + 10
    frame #1: 0x00007ff817a052ee libsystem_info.dylib`_mdns_search_ex + 1217
    frame #2: 0x00007ff817a07c1c libsystem_info.dylib`mdns_addrinfo + 483
    frame #3: 0x00007ff817a079e4 libsystem_info.dylib`search_addrinfo + 164
    frame #4: 0x00007ff817a00659 libsystem_info.dylib`si_addrinfo + 1811
    frame #5: 0x00007ff8179ffeac libsystem_info.dylib`getaddrinfo + 176
    frame #6: 0x00007ff82c296186 libcurl.4.dylib`Curl_getaddrinfo_ex + 36
    frame #7: 0x00007ff82c28ca24 libcurl.4.dylib`getaddrinfo_thread + 83
    frame #8: 0x00007ff82c299c8d libcurl.4.dylib`curl_thread_create_thunk + 29
    frame #9: 0x00007ff8179df4e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #10: 0x00007ff8179daf6b libsystem_pthread.dylib`thread_start + 15

amoeba avatar Mar 14 '24 23:03 amoeba

Thanks a lot @amoeba . It appears therefore that it is timing out when trying to join a thread using Curl_thread_join. This is quite unexpected, though there seems to be a similar issue reported on the Internet: https://github.com/nim-lang/Nim/issues/15019#issuecomment-661208069

Edit: at second sight, the sentence above is incorrect. There is a thread pending on Curl_getaddrinfo_ex, which means that the DNS request is still ongoing. Presumably DNS requests on macOS can take a lot of time to time out?

pitrou avatar Mar 15 '24 12:03 pitrou

It would be nice if you could try again to catch another backtrace, so that we can see if it's always the same test timing out. If it is, we could just disable that test on macOS...

pitrou avatar Mar 15 '24 12:03 pitrou

Even better: can you try getting a traceback later? It seems DNS resolution timeout on macOS is 30 seconds.

pitrou avatar Mar 15 '24 13:03 pitrou