kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-14225: Fix deadlock caused by lazy val exemptSensor

Open hshi2022 opened this issue 2 years ago • 2 comments

There is a chance to cause deadlock when multiple threads access ClientRequestQuotaManager. In the version Scala 2.12, the lazy val initialization is under the object lock. The deadlock could happen in the following condition:

In thread a, when ClientRequestQuotaManager.exemptSensor is being initialized, it has acquired the object lock and enters the the actual initialization block. The initialization of 'exemptSensor' requires another lock private val lock = new ReentrantReadWriteLock() and it is waiting for this lock.

In thread b, at the same time, ClientQuotaManager.updateQuota() is called and it has already acquired ReentrantReadWriteLock lock by calling lock.writeLock().lock(). And then it executes info(). If this is the first time accessing Logging.logger, which is also a lazy val, it need to wait for the object lock.

The deadlock happens.

Since the lazy val initialization is under the object lock, we should avoid using lazy val if the initialization function holds another lock to prevent holding two locks at the same time which is prone for deadlock. Change to create exemptSensor during ClientRequestQuotaManager initialization with an expiration time of Long.MaxValue to prevent expiration if request quota is not enabled at that time.

Committer Checklist (excluded from commit message)

  • [ ] Verify design and implementation
  • [ ] Verify test coverage and CI build status
  • [ ] Verify documentation (including upgrade notes)

hshi2022 avatar Sep 13 '22 18:09 hshi2022

rerun tests please

gitlw avatar Sep 14 '22 17:09 gitlw

@dajac @rajinisivaram could you please review this PR? Thanks! (And I am wondering whether you know how to rerun the checks/tests without submitting any changes. This PR currently fails a test, which I do not think is related to the PR's change.)

hshi2022 avatar Sep 15 '22 18:09 hshi2022

@hshi2022 Thanks for the patch. Did you catch this deadlock in practice or did you use some tool to find it?

hachikuji avatar Sep 26 '22 20:09 hachikuji

@hshi2022 Thanks for the patch. Did you catch this deadlock in practice or did you use some tool to find it?

We caught in practice.

hshi2022 avatar Sep 26 '22 21:09 hshi2022

@hachikuji Thanks for review. I have addressed your comments. And I am wondering whether you know how to rerun the checks/tests without submitting any changes. This PR currently fails a test, which I do not think is related to the PR's change.)

hshi2022 avatar Sep 28 '22 21:09 hshi2022

@hshi2022 Yeah, we are often fighting some flaky tests. It should not block merging if they are unrelated. Can you take a look at the response above https://github.com/apache/kafka/pull/12634#discussion_r982935129? Thanks!

hachikuji avatar Sep 30 '22 17:09 hachikuji

@hshi2022 Yeah, we are often fighting some flaky tests. It should not block merging if they are unrelated. Can you take a look at the response above #12634 (comment)? Thanks!

Replied under that comment.

hshi2022 avatar Sep 30 '22 20:09 hshi2022

@hachikuji Do you have any more questions/comments regarding this PR?

hshi2022 avatar Oct 05 '22 18:10 hshi2022

Kicked off one more build since it's been a little while. I will merge assuming nothing wrong with the build. Thanks for the patch!

hachikuji avatar Oct 10 '22 23:10 hachikuji