calico icon indicating copy to clipboard operation
calico copied to clipboard

calico-kube-controllers fix concurrent map writes issue

Open zamog opened this issue 10 months ago • 4 comments

Description

Add mutex lock to the Set function to prevent a race condition that causes the process to panic with fatal error: concurrent map read and map write

Related issues/PRs

fixes 8705

Todos

  • [ ] Tests
  • [ ] Documentation
  • [ ] Release note

Release Note

TBD

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

zamog avatar Apr 09 '24 18:04 zamog

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Apr 09 '24 18:04 CLAassistant

Deploy Preview for calico-v3-25 canceled.

Name Link
Latest commit a726a60bbd90c3ff9f00b9fcecc0423855d61fa3
Latest deploy log https://app.netlify.com/sites/calico-v3-25/deploys/66158e9e3486240008f42c37

netlify[bot] avatar Apr 09 '24 18:04 netlify[bot]

/sem-approve

lwr20 avatar Apr 10 '24 08:04 lwr20

Thanks for the PR.

Could you show where the conflicting Write is happening on the resource?

If I understand the issue & panic correctly, the panic is generated by reflect pkg performing a read, during a concurrent write, on the WorkloadEndpointData item (particularly the struct's labels map I think?).

If the concurrent write is also occurring in the same Set method, your lock should work.

Do we know what component is writing to that labels map while reflect is reading it? And can we confidently say that synchronising calls of the Set method removes that concurrency?

On a side-note, can you submit a PR of the fix to master branch instead, rather than directly to the release branch, to ensure we don't regress in future releases. We can then backport the master patch to 3.26.

aaaaaaaalex avatar Apr 10 '24 11:04 aaaaaaaalex

@zamog I think we'd be open to adding any necessary locking here, but we need to understand how the concurrent writes happen: adding locks without understanding why is not generally safe. If you can come up with a UT or description of how the concurrent writes happen then we can make a fix.

matthewdupre avatar Jun 28 '24 17:06 matthewdupre