Go (golang bindings): 8 database connections at once causes CGO signal and panic
go version go1.23.3 linux/amd64
I'm given to understand from reading the FoundationDB documentation[1][2] that the key to getting acceptable write throughput is to write through many connections to the database in parallel.
However: when I try to do this, I saw increasing throughput going from 1 to 2, and from 2 to 4 connections; but I get consistent panics of the "CGO received a signal" type when I try with 8 database connections.
See the crash logs linked below and the repo's kvfile/ directory for a reproducer (run the tests).
https://github.com/glycerine/fdb_demo_golang https://github.com/glycerine/fdb_demo_golang/blob/master/crash.log.macos.txt https://github.com/glycerine/fdb_demo_golang/blob/master/crash.log.txt https://github.com/glycerine/fdb_demo_golang/blob/master/crash.log2.txt
run go test -v in the ./kvfile directory of the fdb_demo_golang repo to reproduce the panic. The Test002 test in particular; you can change the database pool size here: https://github.com/glycerine/fdb_demo_golang/blob/master/kvfile/db_test.go#L97
Thanks, Jason
@johscheuer because other Go issues here mention you (but apologies and please re-direct if this is not your area now)
references [1] https://apple.github.io/foundationdb/features.html#concurrent-connections
"FoundationDB is able to handle large numbers of concurrent client connections. Because it uses a threadless communications and concurrency model, FoundationDB does not have to create a thread per connection. This allows full performance even with hundreds of thousands of in-flight requests."
[2] https://apple.github.io/foundationdb/performance.html
"Its asynchronous design allows it to handle very high concurrency, and for a typical workload with 90% reads and 10% writes, maximum throughput is reached at about 200 concurrent operations. This number of operations was achieved with 20 concurrent transactions per FoundationDB process each running 10 operations with 16 byte keys and values between 8 and 100 bytes."
"The implication of this [Little's Law] relation is that, at a given latency, we can maximize throughput only by concurrently submitting enough outstanding requests. A FoundationDB cluster might have a commit latency of 2 ms and yet be capable of far more than 500 commits per second. In fact, tens of thousands of commits per second are easily achievable. To achieve this rate, there must be hundreds of requests happening concurrently. Not having enough pending requests is the single biggest reason for low performance."
details of my fdb setup: It is version 7.3.57, the latest, running on just one node. The client (Go program) and server are co-located on the same node.
$ fdbcli --version
FoundationDB CLI 7.3 (v7.3.57)
source version 63035b5c3ecb65a615126cf3d62e36df58a9c994
protocol fdb00b073000000
jaten@rog ~/go/src/github.com/glycerine/fdb_demo_golang (master) $ fdbcli --exec status
fdbcli --exec status
Using cluster file `/etc/foundationdb/fdb.cluster'.
Configuration:
Redundancy mode - single
Storage engine - ssd-redwood-1
Log engine - ssd-2
Encryption at-rest - disabled
Coordinators - 1
Desired Commit Proxies - 3
Desired GRV Proxies - 1
Desired Resolvers - 1
Desired Logs - 3
Usable Regions - 1
Cluster:
FoundationDB processes - 1
Zones - 1
Machines - 1
Memory availability - 8.0 GB per process on machine with least available
Fault Tolerance - 0 machines
Server time - 12/31/24 21:12:00
Data:
Replication health - Healthy
Moving data - 0.000 GB
Sum of key-value sizes - 315 MB
Disk space used - 5.184 GB
Operating space:
Storage server - 749.0 GB free on most full server
Log server - 748.8 GB free on most full server
Workload:
Read rate - 21 Hz
Write rate - 0 Hz
Transactions started - 6 Hz
Transactions committed - 0 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Client time: 12/31/24 21:12:00
The crash seems to be happening on a finalizer for a tenant, as the trace shows tenant.go:182, being
func (t *tenant) destroy() {
if t.ptr == nil {
return
}
C.fdb_tenant_destroy(t.ptr) // line 182 of tenant.go; we are in this call when we segfault
}
// file version: tenant.go<[email protected]>
db.go:272 2024-12-31T21:52:53.052-06:00 goro 2 done
SIGSEGV: segmentation violation
PC=0x7f1ac7820e03 m=12 sigcode=1 addr=0x40
signal arrived during cgo execution
goroutine 5 gp=0xc000007c00 m=12 mp=0xc000500008 [syscall]:
runtime.cgocall(0x580510, 0xc0000b45a8)
/usr/local/go1.23.3/src/runtime/cgocall.go:167 +0x4b fp=0xc0000b4580 sp=0xc0000b4548 pc=0x46ba0b
github.com/apple/foundationdb/bindings/go/src/fdb._Cfunc_fdb_tenant_destroy(0x7f1a74006bc0)
_cgo_gotypes.go:507 +0x3f fp=0xc0000b45a8 sp=0xc0000b4580 pc=0x5573ff
github.com/apple/foundationdb/bindings/go/src/fdb.(*tenant).destroy.func1(0xc0000b4690?)
/home/jaten/go/pkg/mod/github.com/apple/foundationdb/bindings/[email protected]/src/fdb/tenant.go:182 +0x3e fp=0xc0000b45e8 sp=0xc0000b45a8 pc=0x55ab3e
github.com/apple/foundationdb/bindings/go/src/fdb.(*tenant).destroy(0x0?)
/home/jaten/go/pkg/mod/github.com/apple/foundationdb/bindings/[email protected]/src/fdb/tenant.go:182 +0x19 fp=0xc0000b4600 sp=0xc0000b45e8 pc=0x55aad9
runtime.call16(0x0, 0x5db2f8, 0xc0002c0000, 0x10, 0x10, 0x10, 0xc0000b4690)
/usr/local/go1.23.3/src/runtime/asm_amd64.s:775 +0x43 fp=0xc0000b4620 sp=0xc0000b4600 pc=0x477663
runtime.runfinq()
/usr/local/go1.23.3/src/runtime/mfinal.go:255 +0x3f1 fp=0xc0000b47e0 sp=0xc0000b4620 pc=0x419c71
runtime.goexit({})
running under gdb gives the following stacktrace, indicating the problem is in the fdb_transaction_set_option() C function.
...
db.go:266 2024-12-31T22:10:12.636-06:00 goro 2 done
db.go:266 2024-12-31T22:10:12.637-06:00 goro 1 done
Thread 3 "kvfile.test" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffae0e1700 (LWP 27756)]
0x0000000000000000 in ?? ()
(gdb) bt
bt
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff77abe06 in fdb_transaction_set_option () from /usr/lib/libfdb_c.so
#2 0x00007ffff77abefe in fdb_transaction_set_option () from /usr/lib/libfdb_c.so
#3 0x00007ffff6ea327a in fdb_tenant_destroy () from /usr/lib/libfdb_c.so
#4 0x0000000000478da4 in runtime.asmcgocall () at /usr/local/go1.23.3/src/runtime/asm_amd64.s:923
#5 0x000000c000184700 in ?? ()
#6 0x0000000000000000 in ?? ()
(gdb)
To the best I can figure, this is the code being called
fdb_c.cpp:914 at git tag 7.3.57:
extern "C" DLLEXPORT void fdb_tenant_destroy(FDBTenant* tenant) {
try {
TENANT(tenant)->delref();
} catch (...) {
}
}
and then somehow, for some odd reason, a transaction option is being invoked??? I'm not sure how the delref() gets us to fdb_transaction_set_option_impl, so I'm mystified. Maybe the stack was corrupted and the backtrace is wrong?
fdb_c.cpp:1196 at git tag 7.3.57:
fdb_error_t fdb_transaction_set_option_impl(FDBTransaction* tr,
FDBTransactionOption option,
uint8_t const* value,
int value_length) {
CATCH_AND_RETURN(TXN(tr)->setOption((FDBTransactionOptions::Option)option,
value ? StringRef(value, value_length) : Optional<StringRef>()););
}
running on macos under lldb (the only place I could get the library to build to turn on debug mode build), I see
...
db.go:266 2025-01-01T11:13:15.670Z goro 2 done
db.go:266 2025-01-01T11:13:15.699Z goro 6 done
Process 48043 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x5fbbbdb0bc00)
frame #0: 0x0000000107108e33 libfdb_c.dylib`void delref<ThreadSafeDatabase>(ptr=0x000060000218bbc0) at FastRef.h:95:7
92
93 template <class P>
94 void delref(P* ptr) {
-> 95 ptr->delref();
96 }
97
98 template <class P>
Target 0: (kvfile.test) stopped.
(lldb)
so maybe it is simply that the cast to ITenant is bad? Still not clear to me.
the backtrace from there:
(lldb) (lldb) bt
bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x5fbbbdb\
0bc00)
* frame #0: 0x0000000107108e33 libfdb_c.dylib`void delref<ThreadSafeDatabase>(ptr=0x000060000218bbc\
0) at FastRef.h:95:7
frame #1: 0x0000000107108dfa libfdb_c.dylib`Reference<ThreadSafeDatabase>::~Reference(this=0x0000\
6000034ac850) at FastRef.h:126:4
frame #2: 0x00000001070f82f5 libfdb_c.dylib`Reference<ThreadSafeDatabase>::~Reference(this=0x0000\
6000034ac850) at FastRef.h:124:15
frame #3: 0x00000001070fcebf libfdb_c.dylib`ThreadSafeTenant::~ThreadSafeTenant(this=0x0000600003\
4ac840) at ThreadSafeTransaction.cpp:357:1
frame #4: 0x00000001070fcfb5 libfdb_c.dylib`ThreadSafeTenant::~ThreadSafeTenant(this=0x0000600003\
4ac840) at ThreadSafeTransaction.cpp:353:39
frame #5: 0x00000001070fcfd9 libfdb_c.dylib`ThreadSafeTenant::~ThreadSafeTenant(this=0x0000600003\
4ac840) at ThreadSafeTransaction.cpp:353:39
frame #6: 0x0000000107104ca6 libfdb_c.dylib`ThreadSafeReferenceCounted<ThreadSafeTenant>::delref(\
this=0x00006000034ac848) const at FastRef.h:51:4
frame #7: 0x0000000107103d29 libfdb_c.dylib`ThreadSafeTenant::delref(this=0x00006000034ac840) at \
ThreadSafeTransaction.h:115:73
frame #8: 0x0000000105f67579 libfdb_c.dylib`fdb_tenant_destroy(tenant=0x00006000034ac840) at fdb_\
c.cpp:916:19
frame #9: 0x0000000100077f44 kvfile.test`runtime.asmcgocall.abi0 + 100
frame #10: 0x000000010006a555 kvfile.test`runtime.cgocall + 117
frame #11: 0x000000010015055f kvfile.test`github.com/apple/foundationdb/bindings/go/src/fdb._Cfun\
c_fdb_tenant_destroy.abi0 + 63
frame #12: 0x0000000100153c9e kvfile.test`github.com/apple/foundationdb/bindings/go/src/fdb.(*tenant).destroy.func1 + 62
frame #13: 0x0000000100153c39 kvfile.test`github.com/apple/foundationdb/bindings/go/src/fdb.(*tenant).destroy + 25
frame #14: 0x00000001000767c3 kvfile.test`runtime.call16.abi0 + 67
frame #15: 0x0000000100018f11 kvfile.test`runtime.runfinq + 1009
frame #16: 0x00000001000782a1 kvfile.test`runtime.goexit.abi0 + 1
(lldb)
going up one stack frame from the halt shows
(lldb) up
up
frame #1: 0x0000000107108dfa libfdb_c.dylib`Reference<ThreadSafeDatabase>::~Reference(this=0x00006000\
034ac850) at FastRef.h:126:4
123
124 ~Reference() {
125 if (ptr)
-> 126 delref(ptr);
127 }
128 Reference& operator=(const Reference& r) {
129 P* oldPtr = ptr;
(lldb) (lldb)
and then up again
(lldb) (lldb) up
up
frame #2: 0x00000001070f82f5 libfdb_c.dylib`Reference<ThreadSafeDatabase>::~Reference(this=0x00006000\
034ac850) at FastRef.h:124:15
121 r.setPtrUnsafe(nullptr);
122 }
123
-> 124 ~Reference() {
125 if (ptr)
126 delref(ptr);
127 }
(lldb) (lldb)
and again...
(lldb) (lldb) up
up
frame #3: 0x00000001070fcebf libfdb_c.dylib`ThreadSafeTenant::~ThreadSafeTenant(this=0x00006000034ac8\
40) at ThreadSafeTransaction.cpp:357:1
354 Tenant* t = this->tenant;
355 if (t)
356 onMainThreadVoid([t]() { t->delref(); });
-> 357 }
358
359 ThreadSafeTransaction::ThreadSafeTransaction(DatabaseContext* cx,
360 ISingleThreadTransaction::Type type,
(lldb) (lldb)
(lldb) (lldb) up
up
frame #4: 0x00000001070fcfb5 libfdb_c.dylib`ThreadSafeTenant::~ThreadSafeTenant(this=0x00006000034ac8\
40) at ThreadSafeTransaction.cpp:353:39
350 });
351 }
352
-> 353 ThreadSafeTenant::~ThreadSafeTenant() {
354 Tenant* t = this->tenant;
355 if (t)
356 onMainThreadVoid([t]() { t->delref(); });
(lldb) (lldb)
(lldb) (lldb) up
up
frame #5: 0x00000001070fcfd9 libfdb_c.dylib`ThreadSafeTenant::~ThreadSafeTenant(this=0x00006000034ac8\
40) at ThreadSafeTransaction.cpp:353:39
350 });
351 }
352
-> 353 ThreadSafeTenant::~ThreadSafeTenant() {
354 Tenant* t = this->tenant;
355 if (t)
356 onMainThreadVoid([t]() { t->delref(); });
(lldb) (lldb) up
up
frame #6: 0x0000000107104ca6 libfdb_c.dylib`ThreadSafeReferenceCounted<ThreadSafeTenant>::delref(this\
=0x00006000034ac848) const at FastRef.h:51:4
48 }
49 void delref() const {
50 if (delref_no_destroy())
-> 51 delete (Subclass*)this;
52 }
53 void setrefCountUnsafe(int32_t count) const { referenceCount.store(count); }
54 int32_t debugGetReferenceCount() const { return referenceCount.load(); }
(lldb) (lldb) up
frame #7: 0x0000000107103d29 libfdb_c.dylib`ThreadSafeTenant::delref(this=0x00006000034ac840) at Thre\
adSafeTransaction.h:115:73
112 ThreadFuture<bool> flushBlobRange(const KeyRangeRef& keyRange, bool compact, Optional\
<Version> version) override;
113
114 void addref() override { ThreadSafeReferenceCounted<ThreadSafeTenant>::addref(); }
-> 115 void delref() override { ThreadSafeReferenceCounted<ThreadSafeTenant>::delref(); }
116
117 private:
118 Reference<ThreadSafeDatabase> db;
(lldb) up
frame #8: 0x0000000105f67579 libfdb_c.dylib`fdb_tenant_destroy(tenant=0x00006000034ac840) at fdb_c.cp\
p:916:19
913
914 extern "C" DLLEXPORT void fdb_tenant_destroy(FDBTenant* tenant) {
915 try {
-> 916 TENANT(tenant)->delref();
917 } catch (...) {
918 }
919 }
(lldb) up
frame #9: 0x0000000100077f44 kvfile.test`runtime.asmcgocall.abi0 + 100
kvfile.test`runtime.asmcgocall.abi0:
-> 0x100077f44 <+100>: movq 0x8(%rsp), %rdi
0x100077f49 <+105>: movq 0x8(%rdi), %rsi
0x100077f4d <+109>: subq (%rsp), %rsi
0x100077f51 <+113>: movq %rdi, %gs:0x30
(lldb) (lldb) up
up
frame #10: 0x000000010006a555 kvfile.test`runtime.cgocall + 117
kvfile.test`runtime.cgocall:
-> 0x10006a555 <+117>: xorps %xmm15, %xmm15
0x10006a559 <+121>: movq %gs:0x30, %r14
0x10006a562 <+130>: movl 0x10(%rsp), %eax
0x10006a566 <+134>: movl %eax, 0x1c(%rsp)
(lldb) (lldb) up
up
frame #11: 0x000000010015055f kvfile.test`github.com/apple/foundationdb/bindings/go/src/fdb._Cfunc_fd\
b_tenant_destroy.abi0 + 63
kvfile.test`github.com/apple/foundationdb/bindings/go/src/fdb._Cfunc_fdb_tenant_destroy.abi0:
-> 0x10015055f <+63>: cmpb $0x0, 0x1f61e8(%rip) ; runtime.cgoAlwaysFalse
0x100150566 <+70>: je 0x10015059b ; <+123>
0x100150568 <+72>: movq 0x28(%rsp), %rax
0x10015056d <+77>: xorps %xmm15, %xmm15
(lldb)
going back down the stack and printing this or ptr as we go reveals that it is actually not the Tenant but a ThreadSafeDatabase* that is having the problem:
At the lowest frame (where lldb stopped us on the segfault / EXC_BAD_ACCESS):
(lldb) (lldb) down
down
frame #1: 0x0000000107108dfa libfdb_c.dylib`Reference<ThreadSafeDatabase>::~Reference(this=0x00006000\
034ac850) at FastRef.h:126:4
123
124 ~Reference() {
125 if (ptr)
-> 126 delref(ptr);
127 }
128 Reference& operator=(const Reference& r) {
129 P* oldPtr = ptr;
(lldb) (lldb) p this
p this
(Reference<ThreadSafeDatabase> *) 0x00006000034ac850
(lldb) (lldb) p ptr
p ptr
(ThreadSafeDatabase *) 0x000060000218bbc0
(lldb) (lldb)
still on the lowest frame, lets list the surrounding source code for more context
(lldb) source list -l 80 -c 50
source list -l 80 -c 50
80 };
81
82 #if FLOW_THREAD_SAFE
83 #define ReferenceCounted ThreadSafeReferenceCounted
84 #else
85 #define ReferenceCounted ThreadUnsafeReferenceCounted
86 #endif
87
88 template <class P>
89 void addref(P* ptr) {
90 ptr->addref();
91 }
92
93 template <class P>
94 void delref(P* ptr) {
95 ptr->delref();
96 }
97
98 template <class P>
99 class Reference {
100 public:
101 Reference() : ptr(nullptr) {}
102 explicit Reference(P* ptr) : ptr(ptr) {}
103 static Reference<P> addRef(P* ptr) {
104 ptr->addref();
105 return Reference(ptr);
106 }
107
108 Reference(const Reference& r) : ptr(r.getPtr()) {
109 if (ptr)
110 addref(ptr);
111 }
112 Reference(Reference&& r) noexcept : ptr(r.getPtr()) { r.ptr = nullptr; }
113
114 template <class Q>
115 Reference(const Reference<Q>& r) : ptr(r.getPtr()) {
116 if (ptr)
117 addref(ptr);
118 }
119 template <class Q>
120 Reference(Reference<Q>&& r) : ptr(r.getPtr()) {
121 r.setPtrUnsafe(nullptr);
122 }
123
124 ~Reference() {
125 if (ptr)
-> 126 delref(ptr);
127 }
128 Reference& operator=(const Reference& r) {
129 P* oldPtr = ptr;
(lldb) f
f
frame #1: 0x0000000107108dfa libfdb_c.dylib`Reference<ThreadSafeDatabase>::~Reference(this=0x00006000\
034ac850) at FastRef.h:126:4
123
124 ~Reference() {
125 if (ptr)
-> 126 delref(ptr);
127 }
128 Reference& operator=(const Reference& r) {
129 P* oldPtr = ptr;
(lldb) p *ptr
p *ptr
(ThreadSafeDatabase) {
ThreadSafeReferenceCounted<ThreadSafeDatabase> = {
referenceCount = {
Value = 441
}
}
isConfigDB = false
db = nullptr
}
(lldb) (lldb) p *this
p *this
(Reference<ThreadSafeDatabase>) {
ptr = 0x000060000218bbc0
}
(lldb) (lldb) p ptr
p ptr
(ThreadSafeDatabase *) 0x000060000218bbc0
(lldb) (lldb) p this
p this
(Reference<ThreadSafeDatabase> *) 0x00006000034ac850
(lldb)
The only thing that catches is my eye is that db is nullptr. I wonder what the exact definition of delref is here.
(lldb) disas
disas
libfdb_c.dylib`Reference<ThreadSafeDatabase>::~Reference:
0x107108dd0 <+0>: pushq %rbp
0x107108dd1 <+1>: movq %rsp, %rbp
0x107108dd4 <+4>: subq $0x10, %rsp
0x107108dd8 <+8>: movq %rdi, -0x8(%rbp)
0x107108ddc <+12>: movq -0x8(%rbp), %rax
0x107108de0 <+16>: movq %rax, -0x10(%rbp)
0x107108de4 <+20>: cmpq $0x0, (%rax)
0x107108de8 <+24>: je 0x107108e04 ; <+52> at FastRef.h:127:2
0x107108dee <+30>: movq -0x10(%rbp), %rax
0x107108df2 <+34>: movq (%rax), %rdi
0x107108df5 <+37>: callq 0x107108e20 ; delref<ThreadSafeDatabase> at FastRef.h:94
-> 0x107108dfa <+42>: jmp 0x107108dff ; <+47> at FastRef.h:126:4
0x107108dff <+47>: jmp 0x107108e04 ; <+52> at FastRef.h:127:2
0x107108e04 <+52>: addq $0x10, %rsp
0x107108e08 <+56>: popq %rbp
0x107108e09 <+57>: retq
0x107108e0a <+58>: movq %rax, %rdi
0x107108e0d <+61>: callq 0x105f5e1a0 ; __clang_call_terminate
(lldb) c
Process 48043 resuming
Process 48043 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,
address=0x5fbbbdb0bc00)
frame #0: 0x0000000107108e33 libfdb_c.dylib`void delref<ThreadSafeDatabase>(ptr=0x000060000218bbc\
0) at FastRef.h:95:7
92
93 template <class P>
94 void delref(P* ptr) {
-> 95 ptr->delref();
96 }
97
98 template <class P>
Target 0: (kvfile.test) stopped.
(lldb)
It appears... that the address of delref has gotten messed up some how???
Lets run it again, since I killed it by continuing, and look at the assembly where it halts:
db.go:266 2025-01-01T11:51:02.484Z goro 2 done
db.go:266 2025-01-01T11:51:02.484Z goro 6 done
Process 48581 stopped
* thread #10, stop reason = EXC_BAD_ACCESS (code=1, address=0xbcb37c51cc0)
frame #0: 0x0000000107108e33 libfdb_c.dylib`void delref<ThreadSafeDatabase>(ptr=0x0000600003771c8\
0) at FastRef.h:95:7
92
93 template <class P>
94 void delref(P* ptr) {
-> 95 ptr->delref();
96 }
97
98 template <class P>
Target 0: (kvfile.test) stopped.
(lldb) (lldb) disas
disas
libfdb_c.dylib`delref<ThreadSafeDatabase>:
0x107108e20 <+0>: pushq %rbp
0x107108e21 <+1>: movq %rsp, %rbp
0x107108e24 <+4>: subq $0x10, %rsp
0x107108e28 <+8>: movq %rdi, -0x8(%rbp)
0x107108e2c <+12>: movq -0x8(%rbp), %rdi
0x107108e30 <+16>: movq (%rdi), %rax
-> 0x107108e33 <+19>: callq *0x40(%rax)
0x107108e36 <+22>: addq $0x10, %rsp
0x107108e3a <+26>: popq %rbp
0x107108e3b <+27>: retq
(lldb) (lldb) register read
register read
General Purpose Registers:
rax = 0x00000bcb37c51c80
rbx = 0x000000c0000585a8
rcx = 0x0000000100861080
rdx = 0x0000000000000005
rdi = 0x0000600003771c80
rsi = 0x0000000000000004
rbp = 0x000070000b502de0
rsp = 0x000070000b502dd0
r8 = 0x000000c000400008
r9 = 0x0000000000000000
r10 = 0x0000000000000000
r11 = 0x0000000000000000
r12 = 0x0000000100153c20 kvfile.test`github.com/apple/foundationdb/bindings/go/src/fdb.(*tena\
nt).destroy
r13 = 0x0000000000000000
r14 = 0x000000c000007c00
r15 = 0x0000000000000000
rip = 0x0000000107108e33 libfdb_c.dylib`void delref<ThreadSafeDatabase>(ThreadSafeDatabase*) \
+ 19 at FastRef.h:95:7
rflags = 0x0000000000010202
cs = 0x000000000000002b
fs = 0x0000000000000000
gs = 0x0000000000000000
(lldb)
(lldb) memory read -f x $rax
memory read -f x $rax
error: memory read failed for 0xbcb37c51c00
(lldb)
I'm not an assembly expert, so I'll consult with an AI here...
AI: From this disassembly and register state, we can see that the segfault is happening during a virtual function call, specifically at:
-> 0x107108e33 <+19>: callq *0x40(%rax)
The issue is likely one of these:
- Invalid vtable pointer:
- The code is trying to call a virtual function through the vtable
- It's accessing offset 0x40 (64 bytes) into what should be the vtable
- rax = 0x00000bcb37c51c80 - this is the vtable pointer
- The pointer might be corrupted or the object might have been deleted
- Common causes:
- Use-after-free: Object was deleted but code still tries to use it
- Memory corruption: vtable pointer was overwritten
- Uninitialized pointer: Object was never properly constructed
To debug further, you could:
# Check what's at the vtable address
(lldb) memory read -f x $rax
# Look at the object pointer (in rdi)
(lldb) register read rdi
# Examine the stack trace
(lldb) bt
This looks like a classic use-after-free or memory corruption issue where the virtual function table pointer is invalid.
Me: when I try the first suggestion, I get:
(lldb) memory read -f x $rax
memory read -f x $rax
error: memory read failed for 0xbcb37c51c00
(lldb)
AI:
Ah, this confirms the diagnosis! The fact that the memory read failed means we're trying to access an invalid memory address (0xbcb37c51c80). This is almost certainly a use-after-free bug where:
- The object was allocated and had a valid vtable
- The object was deleted/freed
- Code is still trying to use the object through a dangling pointer
- The memory address is no longer valid/accessible
To track this down:
# Get backtrace to see how we got here
(lldb) bt
# If you have source code, set a breakpoint earlier and watch the pointer
(lldb) b delref<ThreadSafeDatabase>
(lldb) run
# When stopped, watch the object pointer
(lldb) watchpoint set expression -- ptr # replace 'ptr' with actual pointer name
Common patterns that lead to this:
- Reference counting errors
- Storing raw pointers to deleted objects
- Async/threading issues where object lifetime isn't properly managed
- Double-delete scenarios
The fact this is happening in a delref function suggests this might be related to reference counting - possibly the object is being deleted twice or accessed after its reference count hits zero.
Fascinating: I notice that I must run both tests in the db_test.go for the use-after-delete crash to occur. If I run only one test (either) then there is no crash. Test001... always runs before Test002.
Since it is a finalizer (on a tenant structure that itself has a reference to the database) that is causing the crash, and they can run at any time or never... this suggests... perhaps the problem is a delayed finalizer from the first test that gets run during the second test.
"Note the many caveats listed in the runtime.SetFinalizer documentation. You should never depend on or expect a finalizer to ever get run. Instead, always make sure to call the appropriate cleanup function(s) (e.g. defer file.Close() right after checking the errors of os.Open)." -- from https://stackoverflow.com/questions/8595457/which-objects-are-finalized-in-go-by-default-and-what-are-some-of-the-pitfalls-o
So a fundamental design flaw in the FDB Go bindings seems to be using finalizers at all... and having them decrease refcounts (in this case on a database that has already been deleted... <- current hypothesis, not final)
reading the docs in the link above on finalizers, these things stand out:
"If a cyclic structure includes a block with a finalizer, that cycle is not guaranteed to be garbage collected and the finalizer is not guaranteed to run, because there is no ordering that respects the dependencies."
"The finalizer is scheduled to run at some arbitrary time after the program can no longer reach the object to which obj points."
"There is no guarantee that finalizers will run before a program exits, so typically they are useful only for releasing non-memory resources associated with an object during a long-running program. "
"A finalizer may run as soon as an object becomes unreachable. In order to use finalizers correctly, the program must ensure that the object is reachable until it is no longer required. "
Ah hah. We have the smoking gun.
See https://github.com/glycerine/fdb_demo_golang/blob/master/crash.log.3.with.refcounts.and.this.addresses.txt#L269
where I instrumented the C++ ref-counts and log the this-pointer of the ThreadSafeDatabase in fdbclient/include/fdbclient/ThreadSafeTransaction.h
We can clearly see a database from the first test (0x600003e1c000 is its "this" pointer) having its ref-count decremented during Test002 (line 269 of crash.log.3.with.refcounts.and.this.addresses.txt). I've added this log to the repo for reference and inspection.
Attaching the C++ diff/patch-like file (actually a git diff) if anyone wishes to reproduce in the C++. git.diffs.txt
This is an interesting issue, although the problem is that tenant feature people are currently in Snowflake rather than Apple. I am familiar with the refcount part, but I have never thought go programming using go.
But
"There is no guarantee that finalizers will run before a program exits, so typically they are useful only for releasing non-memory resources associated with an object during a long-running program. "
might be the root of the pain.
might be the root of the pain.
Indeed.
There are so many issues with the Go bindings. I've fixed a bunch of bugs and races, but there are still more to do. It hasn't gotten much love in the last five years; I can't imagine it is widely used. (Update: main does have some updates, but apparently v7.3.57 the latest release is much older). Still, I'm trying to keep backwards-compatibility even though this is painful.
I'll submit a pull request when I'm done. I'm not rushing it because I'd like to get rid of all of the finalizers first.
I was about to mention my branch, but it is based on 7.3.57, while trunk main appears to be substantially different; for instance there is no tenant.go file at all on main; and the finalizer for the database has already been deleted. However finalizers for memory reclamation for Futures and Transactions are still in use, which is bad; time bombs /bugs waiting to happen.
I'll need to read more of the history and try and understand what's going on between the branches.
@xis19 Is tenant support on main, or only on a Snowflake branch? I thought v7.3.57, the latest released, was a non-Snowflake branch, but it has tenant support, and main (at least the Go bindings, I don't know about the server) does not, so I'm not sure what is going on. Any context would be helpful.
Update: for example, the most recent common ancestor between 7.3.57 and main is 21 months old; suggesting serious divergence.
$ git merge-base main 7.3.57
warning: refname '7.3.57' is ambiguous.
c5b8908828d7cd1e209d08a54c7c751a482cfa3b
$ git log c5b8908828d7cd1e209d08a54c7c751a482cfa3b
commit c5b8908828d7cd1e209d08a54c7c751a482cfa3b
Author: Jingyu Zhou <[email protected]>
Date: Tue Apr 4 14:31:21 2023 -0700
Update 7.1.30 release notes (#9887)
To my understanding, tenant is not being used by us and we are not supporting it (@jzhou77 please correct me if I am wrong), the full tenant support is available in Snowflake private repository, which is maintained by Snowflake people. So we may not want to touch it unless it breaks something on our side.
There is a similar story in Python, where the "destructor" __del__ is not guaranteed to be called. But we did not have this kind of problem though. Not sure if this information would help or not.
I'd like to get rid of all of the finalizers first.
I second this; there's no clean way to map C++ destructors to Go finalizers and obtain reliable behavior out of it. Let me know if you need help with this.
I was about to mention my branch, but it is based on 7.3.57, while trunk main appears to be substantially different
I fixed already a few bugs related to finalizers, but there are still occasional issues related to network thread and destruction on C++ side which is not coordinated with the Go side, since I occasionally still see some SIGSEGV in tests.
I fixed already a few bugs related to finalizers, but there are still occasional issues related to network thread and destruction on C++ side which is not coordinated with the Go side, since I occasionally still see some SIGSEGV in tests.
@gm42 ouch! that is painful.
I'm blocked on doing anything here because I don't understand the versioning structure in use here, so I would not know on which branch to base any fixes. Do you comprehend it? Usually patches go on top of master/main, but the latest release 7.3.57 appears to have nothing to do with main.
Usually patches go on top of master/main, but the latest release 7.3.57 appears to have nothing to do with main.
I do not understand what do you mean by "have nothing to do withmain". But if your focus is on Go part, I would think you can start with 7.3.57. We can deal with the merging issue.
I do not understand what do you mean by "have nothing to do with main". But if your focus is on Go part, I would think you can start with 7.3.57. We can deal with the merging issue.
Hi @xis19 - Thanks your your help. I was referring to my observations, above, in https://github.com/apple/foundationdb/issues/11856#issuecomment-2567266811, where I note alot of divergence which was confusing to me. The Go part on main seems more up-to-date; so would it not be better to start there? Is main due to be released as 7.3.58 at some point? How is 7.1 different from 7.3, as I've seen it implied (however right or wrong) that both are being released from?
[my branch] is based on 7.3.57, while trunk main appears to be substantially different; for instance there is no tenant.go file at all on main; and the finalizer for the database has already been deleted. However finalizers for memory reclamation for Futures and Transactions are still in use, which is bad; time bombs /bugs waiting to happen.
I'll need to read more of the history and try and understand what's going on between the branches.
...
Update: for example, the most recent common ancestor between 7.3.57 and main is 21 months old; suggesting serious divergence.
$ git merge-base main 7.3.57 warning: refname '7.3.57' is ambiguous. c5b8908828d7cd1e209d08a54c7c751a482cfa3b
$ git log c5b8908828d7cd1e209d08a54c7c751a482cfa3b commit c5b8908828d7cd1e209d08a54c7c751a482cfa3b Author: Jingyu Zhou [email protected] Date: Tue Apr 4 14:31:21 2023 -0700
Update 7.1.30 release notes (#9887)
@glycerine I think it's better to fix on main branch first. if needed, we can cherrypick to 7.3 branch. Since we are going to cut 8.0 release in March, maybe we don't have to cherrypick to 7.3.
@glycerine if it can be of any help, I have submitted all my PRs to main. I think that needs to be done anyways before any backport on release branches.
@gm42 makes perfect sense, thanks.
Just to update: I'm not actively working on at a patch at the moment. There were other issues besides this one that caused me to reconsider if FoundationDB is the right choice for my current project. So @gm42 Giuseppe, if you or someone else wants to dig into a fix for this and/or other Go issues, we won't be duplicating effort. Feel free to take the lead. I can assist with testing and review/discussion if you do, but I'm not actively working towards sending a PR at this point.
I have created a PR #12052 to start discussing the removal of finalizers; feel free to chip in the discussion there.