skypilot
skypilot copied to clipboard
[WIP][Azure] Fix azure race condition
This PR fixes some race condition mentioned in #1408. However, these race conditions should be pretty rare, so I am not sure if they are the root causes (from my understanding, it is more likely due to operating the same cluster at the same time, but I cannot find related code in the test). For me I cannot reproduce any of those errors in #1408 after this fix.
Closes #1408
Thanks for the quick fix @suquark! Did you tried tests/run_smoke_tests.sh test_azure_start_stop
? The error seems happening to me pretty often with that test.
Thanks for the quick fix @suquark! Did you tried
tests/run_smoke_tests.sh test_azure_start_stop
? The error seems happening to me pretty often with that test.
I run test_azure_start_stop
& test_cancel_azure
at the same time and cannot reproduce the error. Also do you mean you can trigger the error with just a single test instance of test_azure_start_stop
?
I run test_azure_start_stop & test_cancel_azure at the same time and cannot reproduce the error. Also do you mean you can trigger the error with just a single test instance of test_azure_start_stop?
Yea, it seems that I can trigger it with one test. Let me try the latest fix to see if that helps.
Btw, we already have the @synchronized
for the _get_filtered_nodes
, will the current fix cause deadlock?
I run test_azure_start_stop & test_cancel_azure at the same time and cannot reproduce the error. Also do you mean you can trigger the error with just a single test instance of test_azure_start_stop?
Yea, it seems that I can trigger it with one test. Let me try the latest fix to see if that helps.
Btw, we already have the
@synchronized
for the_get_filtered_nodes
, will the current fix cause deadlock?
No, because it uses RLock
I just tried this fix, and seems the problem still exists. It seems the problem happens when the test try to sky start
the cluster that was just stopped with sky stop
.
Good to know. Let me keep digging.
This issue has not been observed recently. Closing this for now. Please feel free to re-open it if it occurs again.