calico
calico copied to clipboard
calico-kube-controllers fix concurrent map writes issue
Description
Add mutex lock to the Set function to prevent a race condition that causes the process to panic with fatal error: concurrent map read and map write
Related issues/PRs
fixes 8705
Todos
- [ ] Tests
- [ ] Documentation
- [ ] Release note
Release Note
TBD
Reminder for the reviewer
Make sure that this PR has the correct labels and milestone set.
Every PR needs one docs-*
label.
-
docs-pr-required
: This change requires a change to the documentation that has not been completed yet. -
docs-completed
: This change has all necessary documentation completed. -
docs-not-required
: This change has no user-facing impact and requires no docs.
Every PR needs one release-note-*
label.
-
release-note-required
: This PR has user-facing changes. Most PRs should have this label. -
release-note-not-required
: This PR has no user-facing changes.
Other optional labels:
-
cherry-pick-candidate
: This PR should be cherry-picked to an earlier release. For bug fixes only. -
needs-operator-pr
: This PR is related to install and requires a corresponding change to the operator.
Deploy Preview for calico-v3-25 canceled.
Name | Link |
---|---|
Latest commit | a726a60bbd90c3ff9f00b9fcecc0423855d61fa3 |
Latest deploy log | https://app.netlify.com/sites/calico-v3-25/deploys/66158e9e3486240008f42c37 |
/sem-approve
Thanks for the PR.
Could you show where the conflicting Write is happening on the resource?
If I understand the issue & panic correctly, the panic is generated by reflect
pkg performing a read, during a concurrent write, on the WorkloadEndpointData
item (particularly the struct's labels
map I think?).
If the concurrent write is also occurring in the same Set
method, your lock should work.
Do we know what component is writing to that labels
map while reflect
is reading it? And can we confidently say that synchronising calls of the Set
method removes that concurrency?
On a side-note, can you submit a PR of the fix to master branch instead, rather than directly to the release branch, to ensure we don't regress in future releases. We can then backport the master patch to 3.26.
@zamog I think we'd be open to adding any necessary locking here, but we need to understand how the concurrent writes happen: adding locks without understanding why is not generally safe. If you can come up with a UT or description of how the concurrent writes happen then we can make a fix.