foundationdb
foundationdb copied to clipboard
Using DNS entries in cluster file can cause SIGSEV
As part of this PR: https://github.com/FoundationDB/fdb-kubernetes-operator/pull/1932 I was looking into a failure that looked like a race condition on the first glance but it's not. When I run the same test without using DNS entries everything behaves correct. I can reproduce this issue with the following steps:
- Create a cluster that uses DNS entries in the cluster file
- Stop the operator
- Delete all Pods of the cluster
- Restart the operator
Now the operator is in a crash loop and cannot recover. It seems the issue is that none of the DNS entries can be resolved, which causes the following line to throw an error: https://github.com/apple/foundationdb/blob/release-7.1/fdbclient/AutoPublicAddress.cpp#L51. The method is called in the createDatabase method of the NativeAPI: https://github.com/apple/foundationdb/blob/release-7.1/fdbclient/NativeAPI.actor.cpp#L2213. In the go bindings the createDatabase is called here: https://github.com/apple/foundationdb/blob/release-7.1/bindings/go/src/fdb/fdb.go#L332-L338 but it seems like the error is not correctly passed up all layers.
I only checked the go bindings for this error, but it might be the case that other bindings are also affected.
Stack trace:
Error determining public address.
fatal error: unexpected signal during runtime execution
Error determining public address.
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x7f8c78ded03a]
goroutine 338 [syscall]:
runtime.cgocall(0x15c3670, 0xc0005ba788)
/usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc0005ba760 sp=0xc0005ba728 pc=0x40887c
github.com/apple/foundationdb/bindings/go/src/fdb._Cfunc_fdb_run_network()
_cgo_gotypes.go:378 +0x48 fp=0xc0005ba788 sp=0xc0005ba760 pc=0x139f468
github.com/apple/foundationdb/bindings/go/src/fdb.startNetwork.func1.1()
/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/fdb.go:216 +0x2d fp=0xc0005ba7e0 sp=0xc0005ba788 pc=0x13a266d
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0005ba7e8 sp=0xc0005ba7e0 pc=0x470141
created by github.com/apple/foundationdb/bindings/go/src/fdb.startNetwork.func1
/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/fdb.go:215 +0x113
...
goroutine 311 [sync.Mutex.Lock]:
runtime.gopark(0xc000966528?, 0x46b37b?, 0x60?, 0x92?, 0x200000003?)
/usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc0009664e8 sp=0xc0009664c8 pc=0x43f476
runtime.goparkunlock(...)
/usr/local/go/src/runtime/proc.go:387
runtime.semacquire1(0xc00065e07c, 0x0?, 0x3, 0x1, 0xce?)
/usr/local/go/src/runtime/sema.go:160 +0x20f fp=0xc000966550 sp=0xc0009664e8 pc=0x45022f
sync.runtime_SemacquireMutex(0xc0009665c8?, 0xc5?, 0x7f8bf4010ec0?)
/usr/local/go/src/runtime/sema.go:77 +0x26 fp=0xc000966588 sp=0xc000966550 pc=0x46c266
sync.(*Mutex).lockSlow(0xc00065e078)
/usr/local/go/src/sync/mutex.go:171 +0x165 fp=0xc0009665d8 sp=0xc000966588 pc=0x48b565
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:90
github.com/apple/foundationdb/bindings/go/src/fdb.fdb_future_block_until_ready(0x0?)
/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/futures.go:99 +0x8c fp=0xc000966608 sp=0xc0009665d8 pc=0x13a2d2c
github.com/apple/foundationdb/bindings/go/src/fdb.(*future).BlockUntilReady(0x1691820?)
/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/futures.go:104 +0x6c fp=0xc000966648 sp=0xc000966608 pc=0x13a2eec
github.com/apple/foundationdb/bindings/go/src/fdb.(*futureByteSlice).Get.func1()
/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/futures.go:151 +0xb3 fp=0xc0009666c0 sp=0xc000966648 pc=0x13a33d3
sync.(*Once).doSlow(0xc000966760?, 0x13a6505?)
/usr/local/go/src/sync/once.go:74 +0xc2 fp=0xc000966720 sp=0xc0009666c0 pc=0x48b902
sync.(*Once).Do(...)
/usr/local/go/src/sync/once.go:65
github.com/apple/foundationdb/bindings/go/src/fdb.(*futureByteSlice).Get(0xc0004740c0?)
/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/futures.go:144 +0x4a fp=0xc000966750 sp=0xc000966720 pc=0x13a32ca
github.com/apple/foundationdb/bindings/go/src/fdb.(*futureByteSlice).MustGet(0xc00065e050?)
/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/futures.go:169 +0x19 fp=0xc000966770 sp=0xc000966750 pc=0x13a3739
github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*realFdbLibClient).getValueFromDBUsingKey.func2({0x41f080?})
/workspace/fdbclient/fdb_client.go:65 +0xee fp=0xc0009667d8 sp=0xc000966770 pc=0x15b7bee
github.com/apple/foundationdb/bindings/go/src/fdb.Database.Transact.func1()
/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/database.go:134 +0x72 fp=0xc000966838 sp=0xc0009667d8 pc=0x13a13f2
github.com/apple/foundationdb/bindings/go/src/fdb.retryable(0xc000966880, 0xc000966870)
/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/database.go:85 +0x31 fp=0xc000966860 sp=0xc000966838 pc=0x13a1231
github.com/apple/foundationdb/bindings/go/src/fdb.Database.Transact({0x1b7b5c8?}, 0xc000966968)
/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/database.go:143 +0x8e fp=0xc0009668a8 sp=0xc000966860 pc=0x13a134e
github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*realFdbLibClient).getValueFromDBUsingKey(0xc00049a780, {0x191ab19, 0xe}, 0x9502f9000)
/workspace/fdbclient/fdb_client.go:55 +0x1e7 fp=0xc0009669a0 sp=0xc0009668a8 pc=0x15b77c7
github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.getConnectionStringFromDB(...)
/workspace/fdbclient/common.go:134
github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).GetConnectionString(0x1b7b5c8?)
/workspace/fdbclient/admin_client.go:489 +0x53 fp=0xc000966a40 sp=0xc0009669a0 pc=0x15b4ff3
github.com/FoundationDB/fdb-kubernetes-operator/controllers.tryConnectionOptions({{0x1b7b5c8?, 0xc0003fd800?}, 0x0?}, 0xc000144000, 0xc0000d25a0)
/workspace/controllers/update_status.go:329 +0x63b fp=0xc000966cb8 sp=0xc000966a40 pc=0x15a10bb
github.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).getStatusFromClusterOrDummyStatus(0xc0000d25a0, {{0x1b7b5c8?, 0xc0003fd800?}, 0x195e937?}, 0xc000144000)
/workspace/controllers/cluster_controller.go:464 +0xa6 fp=0xc000966da0 sp=0xc000966cb8 pc=0x15859c6
github.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile(0xc0000d25a0, {0x1b79270, 0xc0003fd3e0}, {{{0xc0005bcc40?, 0x0?}, {0xc000737050?, 0x412507?}}})
/workspace/controllers/cluster_controller.go:138 +0x715 fp=0xc000967cd0 sp=0xc000966da0 pc=0x15828d5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1b79270?, {0x1b79270?, 0xc0003fd3e0?}, {{{0xc0005bcc40?, 0x16a8820?}, {0xc000737050?, 0x413fc6?}}})
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:122 +0xc8 fp=0xc000967d70 sp=0xc000967cd0 pc=0x14744c8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003df540, {0x1b791c8, 0xc0004d3400}, {0x1780880?, 0xc0003c3940?})
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:323 +0x377 fp=0xc000967ee0 sp=0xc000967d70 pc=0x1476617
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003df540, {0x1b791c8, 0xc0004d3400})
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274 +0x1d9 fp=0xc000967f80 sp=0xc000967ee0 pc=0x1475e19
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235 +0x85 fp=0xc000967fe0 sp=0xc000967f80 pc=0x1475865
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000967fe8 sp=0xc000967fe0 pc=0x470141
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231 +0x587
I did a bit more debugging and it seems like the error code is not returned by the create database method and instead of a nil/null pointer an empty struct is returned, logged the value of outdb before and after the fdb_create_database method was called.
Before fdb_create_database outdb: <nil>
After fdb_create_database outdb: &{{{}}} error 0
I have to make some changes to the code next week as I currently don't have any trace events as the code throws the error before creating the trace file here: https://github.com/apple/foundationdb/blob/release-7.1/fdbclient/NativeAPI.actor.cpp#L2213. It seems like the error is swallowed here: https://github.com/apple/foundationdb/blob/release-7.1/fdbclient/ThreadSafeTransaction.cpp#L145-L164
If you observed an issue and used DNS entries in the cluster file you might have hit: apple/foundationdb#11222. Especially in the case when the coordinators are not reachable.
Commenting in reply from https://github.com/FoundationDB/fdb-kubernetes-operator/pull/1949#issuecomment-1963836243
The issue was happening also without the DNS names feature, and even when using a single-process in-memory FDB. It is fairly reproducible on my end; I mitigated it by adding logic to use a wait group to track all the C.fdb_run_network() goroutines created, and then in the Go finalizer for the client connection wait for such wait group after calling C.fdb_stop_network(), but I do not exclude that better/more elegant solutions can be found.
If you observed an issue and used DNS entries in the cluster file you might have hit: apple/foundationdb#11222. Especially in the case when the coordinators are not reachable.
Commenting in reply from FoundationDB/fdb-kubernetes-operator#1949 (comment) The issue was happening also without the DNS names feature, and even when using a single-process in-memory FDB. It is fairly reproducible on my end; I mitigated it by adding logic to use a wait group to track all the
C.fdb_run_network()goroutines created, and then in the Go finalizer for the client connection wait for such wait group after callingC.fdb_stop_network(), but I do not exclude that better/more elegant solutions can be found.
Do you mind to share the code? From what I see fdb_stop_network is not actively called in the FBD go bindings.
Sure! that function fdb_stop_network is indeed not actively called from the Go binding, but perhaps a destructor on the C binding calls it? I cannot explain this, but since the issue was happening only in test suites (which are opening many FDB client connections) what worked is the following diff and calling fdb.StopNetwork() in the TestMain of each test package, in order to guarantee a serialization of these calls:
--- a/foundationdb/bindings/go/src/fdb/fdb.go
+++ b/foundationdb/bindings/go/src/fdb/fdb.go
@@ -196,6 +196,7 @@ var networkStarted bool
var networkMutex sync.Mutex
var openDatabases map[string]Database
+var runningNetwork sync.WaitGroup
func init() {
openDatabases = make(map[string]Database)
@@ -205,9 +206,10 @@ func startNetwork() error {
if e := C.fdb_setup_network(); e != 0 {
return Error{int(e)}
}
-
+ runningNetwork.Add(1)
go func() {
e := C.fdb_run_network()
+ runningNetwork.Done()
if e != 0 {
log.Printf("Unhandled error in FoundationDB network thread: %v (%v)\n", C.GoString(C.fdb_get_error(e)), e)
}
@@ -232,6 +234,25 @@ func StartNetwork() error {
return startNetwork()
}
+// StopNetwork stops the FoundationDB client global network thread and waits for the related goroutine to exit.
+// After stopping the network the networkStarted flag is left as true, so that network cannot be started again.
+// StopNetwork my be called more then once.
+// See also: https://github.com/apple/foundationdb/issues/3015
+func StopNetwork() {
+ waitForNetwork := false
+ {
+ networkMutex.Lock()
+ defer networkMutex.Unlock()
+ if networkStarted {
+ C.fdb_stop_network()
+ waitForNetwork = true
+ }
+ }
+ if waitForNetwork {
+ runningNetwork.Wait()
+ }
+}
+
// DefaultClusterFile should be passed to fdb.Open to allow the FoundationDB C
// library to select the platform-appropriate default cluster file on the current machine.
const DefaultClusterFile string = ""
Related issue: https://github.com/apple/foundationdb/issues/3015
Also related: https://github.com/FoundationDB/fdb-kubernetes-operator/issues/1950
NOTE: the simplification I did here does not solve the issue, but finding a way to serialize fdb_run_network()/fdb_stop_network() will solve it.
I was busy with some other things. Thanks for providing that snippet, I'll give it a try. My plan is to get a core dump and see where things are crashing and how to make the bindings more reliable.
Easiest way you can reproduce it is probably by making a few separate Go packages, and start 20-30 FDB client connections from the tests in each of them.
After doing some further debugging and testing it seems the root cause of the issue was/is that the operator is currently build with the fdb bindings and header files for 6.2 and before starting the operator we are copying the 7.1 library into the operator container and that causes this behaviour. When compiling the operator directly with 7.1 I don't see this issue anymore.