gubernator icon indicating copy to clipboard operation
gubernator copied to clipboard

fix mutex deadlocks in PeerClient

Open miparnisari opened this issue 1 year ago • 2 comments

Description

Closes https://github.com/mailgun/gubernator/issues/222

Testing

[20/02/24 10:26:18] ~/GitHub/gubernator (fix-peer-client) $ go test ./..
. -run=TestPeerClientShutdown -count=20 -timeout=20s -race   

?       github.com/mailgun/gubernator/v2/cmd/gubernator [no test files]
?       github.com/mailgun/gubernator/v2/cmd/gubernator-cluster [no test files]
?       github.com/mailgun/gubernator/v2/cmd/healthcheck        [no test files]
?       github.com/mailgun/gubernator/v2/cmd/gubernator-cli     [no test files]
ok      github.com/mailgun/gubernator/v2        13.354s
ok      github.com/mailgun/gubernator/v2/cluster        2.740s [no tests to run]
[20/02/24 10:26:42] ~/GitHub/gubernator (fix-peer-client) $ 

miparnisari avatar Feb 20 '24 03:02 miparnisari

This code is overly complex and I loath changing it. 😭

I'm probably missing something obvious, but where are we demonstrating there is a a deadlock and where is the deadlock?

thrawn01 avatar Feb 21 '24 03:02 thrawn01

This code is overly complex and I loath changing it. 😭

I'm probably missing something obvious, but where are we demonstrating there is a a deadlock and where is the deadlock?

The deadlock is proven by the flaky test. (If you run go test with -count=100 you will see it)

Sometimes I get "panic: send on close channel". It happens because when you call shutdown, we are closing the channel, but we are not preventing any further requests, which will then try to write to that channel

And sometimes the test times out, it's because one of the lines of code never finishes acquiring the mutex. I forget which line it was

miparnisari avatar Feb 21 '24 03:02 miparnisari

@miparnisari Is this PR ready for merge? Not concerned about the go-bench error. But the PR is still in draft mode.

Baliedge avatar Feb 23 '24 14:02 Baliedge

@Baliedge i don't know what to do about the newly failing TestLeakyBucketDivBug. It seems to be failing because of the immediately preceding test TestHealthCheck. When I run it in isolation, it passes:

[26/02/24 12:36:35] ~/GitHub/gubernator (fix-peer-client) $ go test ./... -run=TestLeakyBucketDivBug -race -count=1 -p=1
ok      github.com/mailgun/gubernator/v2        2.411s
ok      github.com/mailgun/gubernator/v2/cluster        2.014s [no tests to run]
?       github.com/mailgun/gubernator/v2/cmd/gubernator [no test files]
?       github.com/mailgun/gubernator/v2/cmd/gubernator-cli     [no test files]
?       github.com/mailgun/gubernator/v2/cmd/gubernator-cluster [no test files]
?       github.com/mailgun/gubernator/v2/cmd/healthcheck        [no test files]
[26/02/24 12:36:47] ~/GitHub/gubernator (fix-peer-client) $ 

I have the suspicion that TestHealthCheck is setting lastErr and TestLeakyBucketDivBug is regurgitating that.

miparnisari avatar Feb 26 '24 15:02 miparnisari

@Baliedge i don't know what to do about the newly failing TestLeakyBucketDivBug. It seems to be failing because of the immediately preceding test TestHealthCheck.

Since TestHealthCheck causes a server side shutdown and restart, I wonder if it's not ready in time for the next test.

Baliedge avatar Feb 26 '24 19:02 Baliedge

@Baliedge make test still fails locally 😭

image

miparnisari avatar Feb 29 '24 04:02 miparnisari

@miparnisari I noticed after merging. It does work on retry. If you have an idea how to make it better, let's try it.

Baliedge avatar Feb 29 '24 14:02 Baliedge