kiali
kiali copied to clipboard
unexpected ci error in TestClientExpiry
See: https://github.com/kiali/kiali/runs/5680434347?check_suite_focus=true
the changes of the PR should not cause this CI failed.
I've seen this failure before - we never found out why it happens. There might even already be a github issue on this.
https://github.com/kiali/kiali/pull/2968#pullrequestreview-443706509
Ops, I think I know what is tricky here:
the mutex
is a global lock:
https://github.com/kiali/kiali/blob/master/kubernetes/client_factory.go#L19
In TestClientExpiry
:
https://github.com/kiali/kiali/blob/master/kubernetes/client_factory_test.go#L14
After executing getClientFactory
, goroutines start to run watchClients
and they are rapidly using the global mutex
to lock and unlock.
https://github.com/kiali/kiali/blob/master/kubernetes/client_factory.go#L181.
But in test case, it is also using the mutex
global lock, some time, it will lead chaos, like lock failed or conflicts, which will make the delete expired client failed:
https://github.com/kiali/kiali/blob/master/kubernetes/client_factory.go#L184
That is why we see:
client_factory_test.go:57:
Error Trace: client_factory_test.go:57
Error: Not equal:
expected: 1
actual : 2
Test: TestClientExpiry
client_factory_test.go:59:
Error Trace: client_factory_test.go:59
Error: Should be false
Test: TestClientExpiry
Because the first test expired foo
client is not properly delete from clientEntries.
In an elegant way, we should use channel to tell goroutine to recycle this client, not for sleep and loop to read the cliententries.
But in test case, it is also using the mutex global lock, some time, it will lead chaos, like lock failed or conflicts, which will make the delete expired client failed
I've commented in the PR, perhaps the strategy for this work could be to improve the test, but moving a global lock to a channel because a test shouldn't be the first option considered.
(Global locks are perfectly correct resources in Golang and other languages to deal with concurrency scenarios).
Hi @lucasponce thanks for commenting, I have replied in the Pr.
This continues to plague CI and should be added to the backlog.
You'll hit this much more often locally when running the tests with the -race
flag: make test-race
.