skypilot Cache head node ip address

Cache head node ip address

Open suquark opened this issue 2 years ago • 2 comments

The original code tried to reuse a cached head node ip address, but this does not actually work well when multiple processes are accessing the same node. Also the reusing could be dangerous because the head node may already down, so the ip address is actually invalid, and it is not checked in get_node_ips.

This PR stores the head node ip address in the database in the right place (so multiple processes can share it once the head node ip is updated), and it performs a quick check of the aliveness of the head node when getting cached IP.

Tests

[x] Milestone "100 jobs"
[x] Smoke tests
[x] Unit test

May 18 '22 21:05 suquark

accidentally request again... so I convert it to the draft

May 25 '22 22:05 suquark

With the #911, we added a per cluster status lock for any cluster status cache update. Maybe we can use that lock to get rid of the race condition here?

Jun 14 '22 04:06 Michaelvll

skypilot skypilot copied to clipboard

Cache head node ip address

Tests

skypilot
skypilot copied to clipboard