foundationdb icon indicating copy to clipboard operation
foundationdb copied to clipboard

Using DNS entries in cluster file can cause SIGSEV

Open johscheuer opened this issue 1 year ago • 3 comments

As part of this PR: https://github.com/FoundationDB/fdb-kubernetes-operator/pull/1932 I was looking into a failure that looked like a race condition on the first glance but it's not. When I run the same test without using DNS entries everything behaves correct. I can reproduce this issue with the following steps:

  1. Create a cluster that uses DNS entries in the cluster file
  2. Stop the operator
  3. Delete all Pods of the cluster
  4. Restart the operator

Now the operator is in a crash loop and cannot recover. It seems the issue is that none of the DNS entries can be resolved, which causes the following line to throw an error: https://github.com/apple/foundationdb/blob/release-7.1/fdbclient/AutoPublicAddress.cpp#L51. The method is called in the createDatabase method of the NativeAPI: https://github.com/apple/foundationdb/blob/release-7.1/fdbclient/NativeAPI.actor.cpp#L2213. In the go bindings the createDatabase is called here: https://github.com/apple/foundationdb/blob/release-7.1/bindings/go/src/fdb/fdb.go#L332-L338 but it seems like the error is not correctly passed up all layers.

I only checked the go bindings for this error, but it might be the case that other bindings are also affected.

Stack trace:

Error determining public address.
fatal error: unexpected signal during runtime execution
Error determining public address.
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x7f8c78ded03a]


goroutine 338 [syscall]:
runtime.cgocall(0x15c3670, 0xc0005ba788)
	/usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc0005ba760 sp=0xc0005ba728 pc=0x40887c
github.com/apple/foundationdb/bindings/go/src/fdb._Cfunc_fdb_run_network()
	_cgo_gotypes.go:378 +0x48 fp=0xc0005ba788 sp=0xc0005ba760 pc=0x139f468
github.com/apple/foundationdb/bindings/go/src/fdb.startNetwork.func1.1()
	/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/fdb.go:216 +0x2d fp=0xc0005ba7e0 sp=0xc0005ba788 pc=0x13a266d
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0005ba7e8 sp=0xc0005ba7e0 pc=0x470141
created by github.com/apple/foundationdb/bindings/go/src/fdb.startNetwork.func1
	/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/fdb.go:215 +0x113

...

  goroutine 311 [sync.Mutex.Lock]:
  runtime.gopark(0xc000966528?, 0x46b37b?, 0x60?, 0x92?, 0x200000003?)
  	/usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc0009664e8 sp=0xc0009664c8 pc=0x43f476
  runtime.goparkunlock(...)
  	/usr/local/go/src/runtime/proc.go:387
  runtime.semacquire1(0xc00065e07c, 0x0?, 0x3, 0x1, 0xce?)
  	/usr/local/go/src/runtime/sema.go:160 +0x20f fp=0xc000966550 sp=0xc0009664e8 pc=0x45022f
  sync.runtime_SemacquireMutex(0xc0009665c8?, 0xc5?, 0x7f8bf4010ec0?)
  	/usr/local/go/src/runtime/sema.go:77 +0x26 fp=0xc000966588 sp=0xc000966550 pc=0x46c266
  sync.(*Mutex).lockSlow(0xc00065e078)
  	/usr/local/go/src/sync/mutex.go:171 +0x165 fp=0xc0009665d8 sp=0xc000966588 pc=0x48b565
  sync.(*Mutex).Lock(...)
  	/usr/local/go/src/sync/mutex.go:90
  github.com/apple/foundationdb/bindings/go/src/fdb.fdb_future_block_until_ready(0x0?)
  	/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/futures.go:99 +0x8c fp=0xc000966608 sp=0xc0009665d8 pc=0x13a2d2c
  github.com/apple/foundationdb/bindings/go/src/fdb.(*future).BlockUntilReady(0x1691820?)
  	/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/futures.go:104 +0x6c fp=0xc000966648 sp=0xc000966608 pc=0x13a2eec
  github.com/apple/foundationdb/bindings/go/src/fdb.(*futureByteSlice).Get.func1()
  	/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/futures.go:151 +0xb3 fp=0xc0009666c0 sp=0xc000966648 pc=0x13a33d3
  sync.(*Once).doSlow(0xc000966760?, 0x13a6505?)
  	/usr/local/go/src/sync/once.go:74 +0xc2 fp=0xc000966720 sp=0xc0009666c0 pc=0x48b902
  sync.(*Once).Do(...)
  	/usr/local/go/src/sync/once.go:65
  github.com/apple/foundationdb/bindings/go/src/fdb.(*futureByteSlice).Get(0xc0004740c0?)
  	/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/futures.go:144 +0x4a fp=0xc000966750 sp=0xc000966720 pc=0x13a32ca
  github.com/apple/foundationdb/bindings/go/src/fdb.(*futureByteSlice).MustGet(0xc00065e050?)
  	/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/futures.go:169 +0x19 fp=0xc000966770 sp=0xc000966750 pc=0x13a3739
  github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*realFdbLibClient).getValueFromDBUsingKey.func2({0x41f080?})
  	/workspace/fdbclient/fdb_client.go:65 +0xee fp=0xc0009667d8 sp=0xc000966770 pc=0x15b7bee
  github.com/apple/foundationdb/bindings/go/src/fdb.Database.Transact.func1()
  	/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/database.go:134 +0x72 fp=0xc000966838 sp=0xc0009667d8 pc=0x13a13f2
  github.com/apple/foundationdb/bindings/go/src/fdb.retryable(0xc000966880, 0xc000966870)
  	/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/database.go:85 +0x31 fp=0xc000966860 sp=0xc000966838 pc=0x13a1231
  github.com/apple/foundationdb/bindings/go/src/fdb.Database.Transact({0x1b7b5c8?}, 0xc000966968)
  	/go/pkg/mod/github.com/johscheuer/foundationdb/bindings/[email protected]/src/fdb/database.go:143 +0x8e fp=0xc0009668a8 sp=0xc000966860 pc=0x13a134e
  github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*realFdbLibClient).getValueFromDBUsingKey(0xc00049a780, {0x191ab19, 0xe}, 0x9502f9000)
  	/workspace/fdbclient/fdb_client.go:55 +0x1e7 fp=0xc0009669a0 sp=0xc0009668a8 pc=0x15b77c7
  github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.getConnectionStringFromDB(...)
  	/workspace/fdbclient/common.go:134
  github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).GetConnectionString(0x1b7b5c8?)
  	/workspace/fdbclient/admin_client.go:489 +0x53 fp=0xc000966a40 sp=0xc0009669a0 pc=0x15b4ff3
  github.com/FoundationDB/fdb-kubernetes-operator/controllers.tryConnectionOptions({{0x1b7b5c8?, 0xc0003fd800?}, 0x0?}, 0xc000144000, 0xc0000d25a0)
  	/workspace/controllers/update_status.go:329 +0x63b fp=0xc000966cb8 sp=0xc000966a40 pc=0x15a10bb
  github.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).getStatusFromClusterOrDummyStatus(0xc0000d25a0, {{0x1b7b5c8?, 0xc0003fd800?}, 0x195e937?}, 0xc000144000)
  	/workspace/controllers/cluster_controller.go:464 +0xa6 fp=0xc000966da0 sp=0xc000966cb8 pc=0x15859c6
  github.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile(0xc0000d25a0, {0x1b79270, 0xc0003fd3e0}, {{{0xc0005bcc40?, 0x0?}, {0xc000737050?, 0x412507?}}})
  	/workspace/controllers/cluster_controller.go:138 +0x715 fp=0xc000967cd0 sp=0xc000966da0 pc=0x15828d5
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1b79270?, {0x1b79270?, 0xc0003fd3e0?}, {{{0xc0005bcc40?, 0x16a8820?}, {0xc000737050?, 0x413fc6?}}})
  	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:122 +0xc8 fp=0xc000967d70 sp=0xc000967cd0 pc=0x14744c8
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003df540, {0x1b791c8, 0xc0004d3400}, {0x1780880?, 0xc0003c3940?})
  	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:323 +0x377 fp=0xc000967ee0 sp=0xc000967d70 pc=0x1476617
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003df540, {0x1b791c8, 0xc0004d3400})
  	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274 +0x1d9 fp=0xc000967f80 sp=0xc000967ee0 pc=0x1475e19
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
  	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235 +0x85 fp=0xc000967fe0 sp=0xc000967f80 pc=0x1475865
  runtime.goexit()
  	/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000967fe8 sp=0xc000967fe0 pc=0x470141
  created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
  	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231 +0x587

johscheuer avatar Feb 23 '24 14:02 johscheuer

I did a bit more debugging and it seems like the error code is not returned by the create database method and instead of a nil/null pointer an empty struct is returned, logged the value of outdb before and after the fdb_create_database method was called.

Before fdb_create_database outdb: <nil>
After fdb_create_database outdb: &{{{}}} error 0

I have to make some changes to the code next week as I currently don't have any trace events as the code throws the error before creating the trace file here: https://github.com/apple/foundationdb/blob/release-7.1/fdbclient/NativeAPI.actor.cpp#L2213. It seems like the error is swallowed here: https://github.com/apple/foundationdb/blob/release-7.1/fdbclient/ThreadSafeTransaction.cpp#L145-L164

johscheuer avatar Feb 23 '24 16:02 johscheuer

If you observed an issue and used DNS entries in the cluster file you might have hit: apple/foundationdb#11222. Especially in the case when the coordinators are not reachable.

Commenting in reply from https://github.com/FoundationDB/fdb-kubernetes-operator/pull/1949#issuecomment-1963836243 The issue was happening also without the DNS names feature, and even when using a single-process in-memory FDB. It is fairly reproducible on my end; I mitigated it by adding logic to use a wait group to track all the C.fdb_run_network() goroutines created, and then in the Go finalizer for the client connection wait for such wait group after calling C.fdb_stop_network(), but I do not exclude that better/more elegant solutions can be found.

gm42 avatar Feb 26 '24 11:02 gm42

If you observed an issue and used DNS entries in the cluster file you might have hit: apple/foundationdb#11222. Especially in the case when the coordinators are not reachable.

Commenting in reply from FoundationDB/fdb-kubernetes-operator#1949 (comment) The issue was happening also without the DNS names feature, and even when using a single-process in-memory FDB. It is fairly reproducible on my end; I mitigated it by adding logic to use a wait group to track all the C.fdb_run_network() goroutines created, and then in the Go finalizer for the client connection wait for such wait group after calling C.fdb_stop_network(), but I do not exclude that better/more elegant solutions can be found.

Do you mind to share the code? From what I see fdb_stop_network is not actively called in the FBD go bindings.

johscheuer avatar Feb 26 '24 14:02 johscheuer

Sure! that function fdb_stop_network is indeed not actively called from the Go binding, but perhaps a destructor on the C binding calls it? I cannot explain this, but since the issue was happening only in test suites (which are opening many FDB client connections) what worked is the following diff and calling fdb.StopNetwork() in the TestMain of each test package, in order to guarantee a serialization of these calls:

--- a/foundationdb/bindings/go/src/fdb/fdb.go
+++ b/foundationdb/bindings/go/src/fdb/fdb.go
@@ -196,6 +196,7 @@ var networkStarted bool
 var networkMutex sync.Mutex
 
 var openDatabases map[string]Database
+var runningNetwork sync.WaitGroup
 
 func init() {
        openDatabases = make(map[string]Database)
@@ -205,9 +206,10 @@ func startNetwork() error {
        if e := C.fdb_setup_network(); e != 0 {
                return Error{int(e)}
        }
-
+       runningNetwork.Add(1)
        go func() {
                e := C.fdb_run_network()
+               runningNetwork.Done()
                if e != 0 {
                        log.Printf("Unhandled error in FoundationDB network thread: %v (%v)\n", C.GoString(C.fdb_get_error(e)), e)
                }
@@ -232,6 +234,25 @@ func StartNetwork() error {
        return startNetwork()
 }
 
+// StopNetwork stops the FoundationDB client global network thread and waits for the related goroutine to exit.
+// After stopping the network the networkStarted flag is left as true, so that network cannot be started again.
+// StopNetwork my be called more then once.
+// See also: https://github.com/apple/foundationdb/issues/3015
+func StopNetwork() {
+       waitForNetwork := false
+       {
+               networkMutex.Lock()
+               defer networkMutex.Unlock()
+               if networkStarted {
+                       C.fdb_stop_network()
+                       waitForNetwork = true
+               }
+       }
+       if waitForNetwork {
+               runningNetwork.Wait()
+       }
+}
+
 // DefaultClusterFile should be passed to fdb.Open to allow the FoundationDB C
 // library to select the platform-appropriate default cluster file on the current machine.
 const DefaultClusterFile string = ""

Related issue: https://github.com/apple/foundationdb/issues/3015

gm42 avatar Feb 29 '24 08:02 gm42

Also related: https://github.com/FoundationDB/fdb-kubernetes-operator/issues/1950

NOTE: the simplification I did here does not solve the issue, but finding a way to serialize fdb_run_network()/fdb_stop_network() will solve it.

gm42 avatar Feb 29 '24 08:02 gm42

I was busy with some other things. Thanks for providing that snippet, I'll give it a try. My plan is to get a core dump and see where things are crashing and how to make the bindings more reliable.

johscheuer avatar Mar 05 '24 10:03 johscheuer

Easiest way you can reproduce it is probably by making a few separate Go packages, and start 20-30 FDB client connections from the tests in each of them.

gm42 avatar Mar 05 '24 11:03 gm42

After doing some further debugging and testing it seems the root cause of the issue was/is that the operator is currently build with the fdb bindings and header files for 6.2 and before starting the operator we are copying the 7.1 library into the operator container and that causes this behaviour. When compiling the operator directly with 7.1 I don't see this issue anymore.

johscheuer avatar Mar 05 '24 15:03 johscheuer