foundationdb icon indicating copy to clipboard operation
foundationdb copied to clipboard

Segfault in 7.1 with BUILD_AZURE_BACKUP turned on, need help to troubleshoot further

Open harikb opened this issue 2 years ago • 7 comments

Hello,

I built a version of foundationdb 7.1 branch to enable backup to Azure blobstore. I took the latest in the 7.1 branch and followed steps outlined here

Upon running fdbbackup, it segfaults. However, I am unable to obtain a meaningful location/trace. Looking for advice on how I should get more troubleshooting details.

NOTE: I replaced the binaries on top of a 7.12 release that was installed via dpkg. I am also looking for instructions to build a package from the compiled output (instead of copying over the binaries).

It was built with the following cmake options.

$ cmake -DNDEBUG=0 -DBUILD_AZURE_BACKUP=ON -S src/foundationdb -B build_output -G Ninja

// binaries installed and foundationdb restarted
// set sample data via fdbcli - db seems to work for trivial queries (no extensive testing done)

$ export AZURE_KEY= xxxxxxxxxxx
$ fdbbackup start --destcontainer "azure://............./fdbtest02/testing/" --log

SIGNAL: Segmentation fault (11)
Trace: addr2line -e fdbbackup.debug -p -C -f -i 0x7f7fec5bf420
Segmentation fault (core dumped)
$ addr2line -e $(which fdbbackup)  -p -C -f -i 0x7f7fec5bf420
?? ??:0

Last few lines of the local trace file

$ tail -n 4 trace.127.0.0.1.32356.1657571034.0ZQNi5.0.1.xml
<Event Severity="10" Time="1657571034.860529" DateTime="2022-07-11T20:23:54Z" Type="ConnectionExchangingConnectPacket" ID="87d456bd7f6fff03" SuppressedEventCount="0" PeerAddr="127.0.0.1:4500" ThreadID="4733494822528056024" Machine="127.0.0.1:32356" LogGroup="default" ClientDescription="primary-7.1.13-4733494822528056024" />
<Event Severity="10" Time="1657571034.861009" DateTime="2022-07-11T20:23:54Z" Type="ConnectionEstablished" ID="87d456bd7f6fff03" SuppressedEventCount="0" Peer="127.0.0.1:4500" ConnectionId="1" ThreadID="4733494822528056024" Machine="127.0.0.1:32356" LogGroup="default" ClientDescription="primary-7.1.13-4733494822528056024" />
<Event Severity="10" Time="1657571034.861009" DateTime="2022-07-11T20:23:54Z" Type="ConnectedOutgoing" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="127.0.0.1:4500" ThreadID="4733494822528056024" Machine="127.0.0.1:32356" LogGroup="default" ClientDescription="primary-7.1.13-4733494822528056024" />
<Event Severity="40" ErrorKind="BugDetected" Time="1657571035.184829" DateTime="2022-07-11T20:23:55Z" Type="Crash" ID="0000000000000000" Signal="11" Name="Segmentation fault" Trace="addr2line -e fdbbackup.debug -p -C -f -i 0x7f7fec5bf420" ThreadID="16319990330548784926" Backtrace="addr2line -e fdbbackup.debug -p -C -f -i 0xf310ac 0xf2fcc0 0xf300be 0xf05873 0x7f7fec5bf420" Machine="127.0.0.1:32356" LogGroup="default" ClientDescription="primary-7.1.13-4733494822528056024" />

$ file $(which fdbbackup)
/usr/bin/fdbbackup: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=2e6cc0a000f57ab5bfc6deeff93941b133386cb5, with debug_info, not stripped

harikb avatar Jul 11 '22 20:07 harikb

Here is some more info from my (probably misguided) attempt to run it with gdb

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6a9f700 (LWP 294882)]
[New Thread 0x7ffff625e700 (LWP 294883)]
[New Thread 0x7ffff5a3a700 (LWP 294884)]
[New Thread 0x7ffff5239700 (LWP 294885)]
[Thread 0x7ffff5239700 (LWP 294885) exited]

Thread 3 "fdbbackup" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff625e700 (LWP 294883)]
__GI___libc_free (mem=0x1) at malloc.c:3102
3102    malloc.c: No such file or directory.
(gdb) where
#0  __GI___libc_free (mem=0x1) at malloc.c:3102
#1  0x00000000010551aa in SSL_SESSION_free ()
#2  0x000000000104f899 in SSL_free ()
#3  0x00007ffff7f8b7cd in ?? () from /lib/x86_64-linux-gnu/libcurl.so.4
#4  0x00007ffff7f8b825 in ?? () from /lib/x86_64-linux-gnu/libcurl.so.4
#5  0x00007ffff7f498ce in ?? () from /lib/x86_64-linux-gnu/libcurl.so.4
#6  0x00007ffff7f5e102 in ?? () from /lib/x86_64-linux-gnu/libcurl.so.4
#7  0x00007ffff7f5e7fe in ?? () from /lib/x86_64-linux-gnu/libcurl.so.4
#8  0x00007ffff7f5fd51 in curl_multi_perform () from /lib/x86_64-linux-gnu/libcurl.so.4
#9  0x00007ffff7f561bb in curl_easy_perform () from /lib/x86_64-linux-gnu/libcurl.so.4
#10 0x00000000010357dc in azure::storage_lite::CurlEasyRequest::perform (this=this@entry=0x7fffec000bd0)
    at fdbclient/azurestorage-src/src/http/libcurl_http_client.cpp:73
#11 0x0000000001036dc2 in azure::storage_lite::CurlEasyRequest::submit(std::function<void (int, azure::storage_lite::storage_istream, CURLcode)>, std::chrono::duration<long, std::ratio<1l, 1l> >) (this=this@entry=0x7fffec000bd0, cb=..., interval=...,
    interval@entry=...) at fdbclient/azurestorage-src/include/http/libcurl_http_client.h:97
#12 0x0000000000fee037 in azure::storage_lite::async_executor<void>::submit_helper (promise=..., outcome=..., account=...,
    request=..., http=..., context=..., retry=...) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:87
#13 0x0000000000feff16 in azure::storage_lite::async_executor<void>::submit (account=..., request=..., http=..., context=...)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/atomicity.h:96
#14 0x0000000000fe1309 in azure::storage_lite::blob_client::create_container (this=<optimized out>, container=...)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/atomicity.h:96
#15 0x0000000000d8c45a in BackupContainerAzureBlobStore::<lambda()>::operator()(void) const (__closure=0x1b65c70)
    at /tmp/src/foundationdb/fdbclient/BackupContainerAzureBlobStore.actor.cpp:338
#16 0x0000000000d8ccb0 in AsyncTaskThread::<lambda()>::operator() (this=0x1b65c68)
    at /tmp/src/foundationdb/fdbclient/AsyncTaskThread.h:46
#17 AsyncTask<AsyncTaskThread::execAsync(const F&, TaskPriority) [with F = BackupContainerAzureBlobStore::create()::<lambda()>; decltype (func()) = Void]::<lambda()> >::operator()(void) (this=0x1b65c60) at /tmp/src/foundationdb/fdbclient/AsyncTaskThread.h:46
#18 0x0000000000dc4df4 in AsyncTaskThread::run (self=0x1af31a8) at /tmp/src/foundationdb/fdbclient/AsyncTaskThread.actor.cpp:88
#19 0x000000000122060f in execute_native_thread_routine ()
#20 0x00007ffff7dac609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#21 0x00007ffff7ccf133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

harikb avatar Jul 11 '22 20:07 harikb

Well, that was quick.... I discovered the segfault was caused by my using a path that had two components - "fdbtest02/testing/" - just putting "fdbtest03" seems to have created at least that part. I can confirm it is able to talk to Azure as I can see a directory created there. The backup still failed with a different error, but let me dig this some more..

Details:
 LogBytes written - 0
 RangeBytes written - 0
 Last complete log version and timestamp        - 1456970928727, 2022/07/11.20:40:47+0000
 Last complete snapshot version and timestamp   - N/A, N/A
 Current Snapshot start version and timestamp   - 1456970928727, 2022/07/11.20:40:47+0000
 Expected snapshot end version and timestamp    - 1456970928727, 2022/07/11.20:40:47+0000
 Backup supposed to stop at next snapshot completion - Yes
Recent Errors (since initialization)
23 second(s) ago : 'Backup Container URL invalid' on 'file_backup_write_logs_5.2'

harikb avatar Jul 11 '22 20:07 harikb

Error seem to be coming from the message - invalid URL prefix and it appears as per code that can only happen if BUILD_AZURE_BACKUP is not defined. Quite possible some binary/library hasn't been updated/replaced with the custom build. I have replaced all executable binaries and .so files, but something is still missing.

Coincidently, fdbbackup status also seem to want the AZURE_KEY even if I am merely trying to find the status of the backup.

    $ fdbbackup status
    [snip]
    BackupUID: f20c3287a008daf81d6c7067ac6b603c
    BackupURL: azure://.....[snip]...../fdbtest05
    Snapshot interval is 864000 seconds.  The initial snapshot is still running.
    
    Details:
     LogBytes written - 0
     RangeBytes written - 0
     Last complete log version and timestamp        - 1461196288219, 2022/07/11.22:27:00+0000
     Last complete snapshot version and timestamp   - N/A, N/A
     Current Snapshot start version and timestamp   - 1461196288219, 2022/07/11.22:27:00+0000
     Expected snapshot end version and timestamp    - 1461196288219, 2022/07/11.22:27:00+0000
     Backup supposed to stop at next snapshot completion - Yes
    Recent Errors (since initialization)
    47 second(s) ago : 'Backup Container URL invalid' on 'file_backup_write_logs_5.2'
    
    $ unset AZURE_KEY
    $ fdbbackup status
    ERROR: Backup error
    Fatal Error: Backup error

The trace for the above error, since it is unrelated, is logged here. We don't need to delve into it here.

harikb avatar Jul 11 '22 22:07 harikb

It seems not all code is enabled - there should ideally be a #define here. I will make that change and see if it makes progress.

harikb avatar Jul 11 '22 22:07 harikb

It seems the Segfault is caused at BackupContainerAzureBlobStore.actor.cpp:336 when the container already exists. However, the trace is well inside azure::storage_lite::CurlEasyRequest::perform . I also seeing other folks reporting similar bugs

https://github.com/Azure/azure-storage-cpplite/issues/87 https://github.com/Azure/azure-storage-cpplite/issues/114 https://github.com/Azure/azure-storage-cpplite/issues/115

harikb avatar Jul 15 '22 23:07 harikb

@harikb Looks like you fixed the issue, shall we close this?

sfc-gh-clin avatar Jul 28 '22 21:07 sfc-gh-clin

@sfc-gh-clin - I still see a repeat of the issue in a different environment. I am still debugging.

harikb avatar Jul 29 '22 15:07 harikb