upjet icon indicating copy to clipboard operation
upjet copied to clipboard

Race condition in terraformPluginSDKExternal causing provider restarts

Open jake-ciolek opened this issue 10 months ago • 4 comments

What happened?

This has been previously reported here and here. Unfortunately, it hasn't been resolved yet.

This issue is prevalent in big environments where the provider manages many resources. The end result is that the Upjet provider pod restarts and affects stability in large Crossplane-managed environments using Upjet generated providers.

How can we reproduce it?

This is observed in a large Upjet-managed environment, for example one with a few hundred SQS Queues.

Root cause

I looked into this a bit and ran a race detector on Upjet itself, it fails with:

==================
WARNING: DATA RACE
Write at 0x000103c39580 by goroutine 82:
  github.com/crossplane/upjet/pkg/resource/fake.(*Observable).SetObservation()
      /Users/[email protected]/25feb/upjet/pkg/resource/fake/terraformed.go:32 +0x3c
  github.com/crossplane/upjet/pkg/resource/fake.(*Terraformed).SetObservation()
      <autogenerated>:1 +0x20
  github.com/crossplane/upjet/pkg/controller.(*terraformPluginSDKExternal).Update()
      /Users/[email protected]/25feb/upjet/pkg/controller/external_tfpluginsdk.go:718 +0x530
  github.com/crossplane/upjet/pkg/controller.(*terraformPluginSDKAsyncExternal).Update.func1()
      /Users/[email protected]/25feb/upjet/pkg/controller/external_async_tfpluginsdk.go:199 +0x2a0

Previous write at 0x000103c39580 by goroutine 79:
  github.com/crossplane/upjet/pkg/resource/fake.(*Observable).SetObservation()
      /Users/[email protected]/25feb/upjet/pkg/resource/fake/terraformed.go:32 +0x3c
  github.com/crossplane/upjet/pkg/resource/fake.(*Terraformed).SetObservation()
      <autogenerated>:1 +0x20
  github.com/crossplane/upjet/pkg/controller.(*terraformPluginSDKExternal).Create()
      /Users/[email protected]/25feb/upjet/pkg/controller/external_tfpluginsdk.go:660 +0x5c0
  github.com/crossplane/upjet/pkg/controller.(*terraformPluginSDKAsyncExternal).Create.func1()
      /Users/[email protected]/25feb/upjet/pkg/controller/external_async_tfpluginsdk.go:166 +0x2a0

Goroutine 82 (running) created at:
  github.com/crossplane/upjet/pkg/controller.(*terraformPluginSDKAsyncExternal).Update()
      /Users/[email protected]/25feb/upjet/pkg/controller/external_async_tfpluginsdk.go:178 +0x244
  github.com/crossplane/upjet/pkg/controller.TestAsyncTerraformPluginSDKUpdate.func3()
      /Users/[email protected]/25feb/upjet/pkg/controller/external_async_tfpluginsdk_test.go:286 +0x94
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.24.0/libexec/src/testing/testing.go:1792 +0x180
  testing.(*T).Run.gowrap1()
      /opt/homebrew/Cellar/go/1.24.0/libexec/src/testing/testing.go:1851 +0x40

Goroutine 79 (finished) created at:
  github.com/crossplane/upjet/pkg/controller.(*terraformPluginSDKAsyncExternal).Create()
      /Users/[email protected]/25feb/upjet/pkg/controller/external_async_tfpluginsdk.go:145 +0x244
  github.com/crossplane/upjet/pkg/controller.TestAsyncTerraformPluginSDKCreate.func3()
      /Users/[email protected]/25feb/upjet/pkg/controller/external_async_tfpluginsdk_test.go:242 +0x94
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.24.0/libexec/src/testing/testing.go:1792 +0x180
  testing.(*T).Run.gowrap1()
      /opt/homebrew/Cellar/go/1.24.0/libexec/src/testing/testing.go:1851 +0x40
==================
--- FAIL: TestConnect (0.00s)
    testing.go:1490: race detected during execution of test

It looks like it might be the cause of the problem, where the controller's Create and Update call the SetObservation method concurrently on a vanilla Go map which is not thread safe.

Solution

Introduce synchronization or use a thread-safe data structure like sync.Map.

jake-ciolek avatar Feb 19 '25 10:02 jake-ciolek

We've seen another failure related to this code, but now it's about writing to an uninitialized map. This occured in the S3 provider.

See:

2025/02/27 04:40:16 [DEBUG] Waiting for state to become: [success]
panic: assignment to entry in nil map

goroutine 292330486 [running]:
reflect.mapassign0(0xeb14160, 0xc002388020?, 0xc008c7dc38?, 0xc01880bd40?)
	runtime/map.go:1370 +0x1d
reflect.mapassign(0xc01880bd40?, 0xc005853b88?, 0xc005853bd8?, 0x40ffa9?)
	reflect/value.go:3828 +0x8d
github.com/modern-go/reflect2.(*UnsafeMapType).UnsafeSetIndex(...)
	github.com/modern-go/[email protected]/unsafe_map.go:76
github.com/json-iterator/go.(*mapDecoder).Decode(0xc006a17180, 0xc00f5114a0, 0xc01880bd40)
	github.com/json-iterator/[email protected]/reflect_map.go:191 +0x325
github.com/json-iterator/go.(*placeholderDecoder).Decode(0xc00c557f00?, 0x4159eb?, 0xc005853c60?)
	github.com/json-iterator/[email protected]/reflect.go:324 +0x1b
github.com/json-iterator/go.(*structFieldDecoder).Decode(0xc00b583640, 0xc00dba2000?, 0xc01880bd40)
	github.com/json-iterator/[email protected]/reflect_struct_decoder.go:1054 +0x50
github.com/json-iterator/go.(*generalStructDecoder).decodeOneField(0xc00b583a80, 0xc004e4ced0?, 0xc01880bd40)
	github.com/json-iterator/[email protected]/reflect_struct_decoder.go:552 +0x29c
github.com/json-iterator/go.(*generalStructDecoder).Decode(0xc00b583a80, 0xdb9d4e0?, 0xc01880bd40)
	github.com/json-iterator/[email protected]/reflect_struct_decoder.go:508 +0x8b
github.com/json-iterator/go.(*Iterator).ReadVal(0xc01880bd40, {0xf00b7c0, 0xc00f511398})
	github.com/json-iterator/[email protected]/reflect.go:79 +0x123
github.com/json-iterator/go.(*frozenConfig).Unmarshal(0xc0002394a0, {0xc00b42c700?, 0xc0173eeb70?, 0x10cb0ae0?}, {0xf00b7c0, 0xc00f511398})
	github.com/json-iterator/[email protected]/config.go:348 +0x99
github.com/upbound/provider-aws/apis/s3/v1beta1.(*Bucket).SetObservation(0xc00f511200, 0xc00e725750?)
	github.com/upbound/provider-aws/apis/s3/v1beta1/zz_generated_terraformed.go:47 +0x7e
github.com/crossplane/upjet/pkg/controller.(*noForkExternal).Update(0xc018819040, {0x12df5968, 0xc010945f80}, {0x12e4c5c0?, 0xc00f511200})
	github.com/crossplane/[email protected]/pkg/controller/external_nofork.go:673 +0x27b
github.com/crossplane/upjet/pkg/controller.(*noForkAsyncExternal).Update.func1()
	github.com/crossplane/[email protected]/pkg/controller/external_async_nofork.go:161 +0x145
created by github.com/crossplane/upjet/pkg/controller.(*noForkAsyncExternal).Update in goroutine 3936
	github.com/crossplane/[email protected]/pkg/controller/external_async_nofork.go:157 +0x168

jake-ciolek avatar Feb 27 '25 08:02 jake-ciolek

I managed to reproduce this by creating 2500 SQS Queues via a composition. Then I created 500 other queues using the same composition and making it target the same downstream resource via external-name, but with different tags. It triggers reliably every 20 minutes or so.

I also replaced json-iter import with the fix mentioned by someone on Slack, but the race detector still detects the same race condition.

If it's helpful, I can share the composition I used. It's likely this can be triggered with raw MR's. This manifests when there are multiple MR's, with differing definitions, mutating tags at the same downstream resource.

jake-ciolek avatar Feb 28 '25 09:02 jake-ciolek

After a deeper-dive, the conclusion is that this is caused by the usage of unsafe reflect2 in json-iterator. See the discussions

The json-iterator fix linked above was a bit of a red herring as it was fixing a different concurrent map write error. It was also not-functional. I fixed it in my own fork in case someone wants to try it out (using it still results in the original error reported).

The race condition found in TestConnect is also unrelated (according to @erhancagirici) and an artifact of object reuse.

For posterity's sake, there is another concurrent map panic, this time in sigs.k8s.io/json (log included at the end).

I will drop this for now. If anyone wants to pick it up, feel free to do so.

fatal error: concurrent map writes

goroutine 4408 [running]: internal/runtime/maps.fatal({0x1b457f6f?, 0x10?}) runtime/panic.go:1053 +0x20 reflect.mapassign_faststr0(0x197f3020, 0xf1634?, {0x40078329b0?, 0x17723940?}, 0x4058871f08) runtime/map_swiss.go:269 +0x28 reflect.mapassign_faststr(0x17723940?, 0x4058871f08?, {0x40078329b0, 0xb}, 0x19?) reflect/value.go:3597 +0xbc reflect.Value.SetMapIndex({0x197f3020?, 0x4029178310?, 0x7b6?}, {0x17d0d500, 0x4068710770, 0x98}, {0x17723940, 0x4058871f08, 0x196}) reflect/map_swiss.go:427 +0x1c4 sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).object(0x406a00a410, {0x197f3020?, 0x4029178310?, 0x0?}) sigs.k8s.io/[email protected]/internal/golang/encoding/json/decode.go:909 +0x13c4 sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).value(0x406a00a410, {0x197f3020?, 0x4029178310?, 0x4?}) sigs.k8s.io/[email protected]/internal/golang/encoding/json/decode.go:401 +0x40 sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).object(0x406a00a410, {0x1b067c80?, 0x4029178288?, 0x19c03560?}) sigs.k8s.io/[email protected]/internal/golang/encoding/json/decode.go:867 +0xe90 sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).value(0x406a00a410, {0x1b067c80?, 0x4029178288?, 0x0?}) sigs.k8s.io/[email protected]/internal/golang/encoding/json/decode.go:401 +0x40 sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).object(0x406a00a410, {0x19c03560?, 0x4029178268?, 0x1ace?}) sigs.k8s.io/[email protected]/internal/golang/encoding/json/decode.go:867 +0xe90 sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).value(0x406a00a410, {0x19c03560?, 0x4029178268?, 0x400836d600?}) sigs.k8s.io/[email protected]/internal/golang/encoding/json/decode.go:401 +0x40 sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).object(0x406a00a410, {0x1b1be020?, 0x4029178008?, 0x298a9c?}) sigs.k8s.io/[email protected]/internal/golang/encoding/json/decode.go:867 +0xe90 sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).value(0x406a00a410, {0x1b1be020?, 0x4029178008?, 0x2980f0?}) sigs.k8s.io/[email protected]/internal/golang/encoding/json/decode.go:401 +0x40 sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).unmarshal(0x406a00a410, {0x1b1be020?, 0x4029178008?}) sigs.k8s.io/[email protected]/internal/golang/encoding/json/decode.go:188 +0x100 sigs.k8s.io/json/internal/golang/encoding/json.Unmarshal({0x400836d600, 0x1623, 0x1b00}, {0x1b1be020, 0x4029178008}, {0x4008e18c50, 0x2, 0x4008e18c30?}) sigs.k8s.io/[email protected]/internal/golang/encoding/json/decode.go:113 +0x124 sigs.k8s.io/json.UnmarshalCaseSensitivePreserveInts(...) sigs.k8s.io/[email protected]/json.go:62 k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).unmarshal(0x1e719a90?, {0x1e7347e0?, 0x4029178008?}, {0x400836d600?, 0x40061174e0?, 0x12?}, {0x400836d600?, 0x7?, 0x40078320e0?}) k8s.io/[email protected]/pkg/runtime/serializer/json/json.go:258 +0x188 k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).Decode(0x400125f4a0, {0x400836d600, 0x1623, 0x1b00}, 0x0, {0x1e7347e0, 0x4029178008}) k8s.io/[email protected]/pkg/runtime/serializer/json/json.go:206 +0x4d0 k8s.io/apimachinery/pkg/runtime.WithoutVersionDecoder.Decode({{0x1e6ac6a0?, 0x400125f4a0?}}, {0x400836d600?, 0x199?, 0x1b1be020?}, 0x4029178008?, {0x1e7347e0?, 0x4029178008?}) k8s.io/[email protected]/pkg/runtime/helper.go:256 +0x50 sigs.k8s.io/controller-runtime/pkg/client/apiutil.targetZeroingDecoder.Decode({{0x1e6a5340?, 0x4068710490?}}, {0x400836d600, 0x1623, 0x1b00}, 0x0, {0x1e7347e0, 0x4029178008}) sigs.k8s.io/[email protected]/pkg/client/apiutil/apimachinery.go:215 +0x8c k8s.io/client-go/rest.Result.Into({{0x400836d600, 0x1623, 0x1b00}, {0x0, 0x0, 0x0}, {0x4017333c20, 0x10}, {0x0, 0x0}, ...}, ...) k8s.io/[email protected]/rest/request.go:1373 +0x98 sigs.k8s.io/controller-runtime/pkg/client.(*typedClient).UpdateSubResource(0x4000de0630, {0x1e9a0c00, 0x4062c98460}, {0x1eacd238, 0x4029178008}, {0x1b385d41, 0x6}, {0x0, 0x0, 0x0?}) sigs.k8s.io/[email protected]/pkg/client/typed_client.go:247 +0x464 sigs.k8s.io/controller-runtime/pkg/client.(*subResourceClient).Update(0x4063475b48, {0x1e9a0c00, 0x4062c98460}, {0x1eacd238, 0x4029178008}, {0x0, 0x0, 0x0}) sigs.k8s.io/[email protected]/pkg/client/client.go:590 +0x1e8 github.com/crossplane/crossplane-runtime/pkg/reconciler/managed.(*Reconciler).Reconcile(0x4001842480, {0x1e9a0b58, 0x406bee9b60}, {{{0x0, 0x0}, {0x4002bdb6e0, 0x1e}}}) github.com/crossplane/[email protected]/pkg/reconciler/managed/reconciler.go:1309 +0x7774 github.com/crossplane/crossplane-runtime/pkg/ratelimiter.(*Reconciler).Reconcile(0x40000d39f0, {0x1e9a0b58, 0x406bee9b60}, {{{0x0?, 0x5?}, {0x4002bdb6e0?, 0x4008e19d08?}}}) github.com/crossplane/[email protected]/pkg/ratelimiter/reconciler.go:54 +0x124 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1e9c8830?, {0x1e9a0b58?, 0x406bee9b60?}, {{{0x0?, 0xb?}, {0x4002bdb6e0?, 0x0?}}}) sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114 +0x80 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x40012b8370, {0x1e9a0b90, 0x400125f9f0}, {0x19cbfea0, 0x40072738a0}) sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311 +0x2c8 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x40012b8370, {0x1e9a0b90, 0x400125f9f0}) sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261 +0x150 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2() sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222 +0x70 created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 3747 sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218 +0x3b4

jake-ciolek avatar Mar 07 '25 13:03 jake-ciolek

fatal error: concurrent map writes ... sigs.k8s.io/controller-runtime/pkg/client.(*subResourceClient).Update(...) sigs.k8s.io/[email protected]/pkg/client/client.go:590 +0x1e8 github.com/crossplane/crossplane-runtime/pkg/reconciler/managed.(*Reconciler).Reconcile(...) github.com/crossplane/[email protected]/pkg/reconciler/managed/reconciler.go:1309 +0x7774

This is at the bottom of Reconcile(..):

   // We've successfully updated our external resource. Per the below issue
   r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
}

I suspect async upjet and "sync" crossplane-runtime works on the same representation of the managed resource in memory here.

This might mean it's related to https://github.com/crossplane/crossplane-runtime/issues/790

chlunde avatar Mar 07 '25 13:03 chlunde