skypilot
skypilot copied to clipboard
Cache head node ip address
The original code tried to reuse a cached head node ip address, but this does not actually work well when multiple processes are accessing the same node. Also the reusing could be dangerous because the head node may already down, so the ip address is actually invalid, and it is not checked in get_node_ips
.
This PR stores the head node ip address in the database in the right place (so multiple processes can share it once the head node ip is updated), and it performs a quick check of the aliveness of the head node when getting cached IP.
Tests
- [x] Milestone "100 jobs"
- [x] Smoke tests
- [x] Unit test
accidentally request again... so I convert it to the draft
With the #911, we added a per cluster status lock for any cluster status cache update. Maybe we can use that lock to get rid of the race condition here?