skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

Cache head node ip address

Open suquark opened this issue 2 years ago • 2 comments

The original code tried to reuse a cached head node ip address, but this does not actually work well when multiple processes are accessing the same node. Also the reusing could be dangerous because the head node may already down, so the ip address is actually invalid, and it is not checked in get_node_ips.

This PR stores the head node ip address in the database in the right place (so multiple processes can share it once the head node ip is updated), and it performs a quick check of the aliveness of the head node when getting cached IP.

Tests

  • [x] Milestone "100 jobs"
  • [x] Smoke tests
  • [x] Unit test

suquark avatar May 18 '22 21:05 suquark

accidentally request again... so I convert it to the draft

suquark avatar May 25 '22 22:05 suquark

With the #911, we added a per cluster status lock for any cluster status cache update. Maybe we can use that lock to get rid of the race condition here?

Michaelvll avatar Jun 14 '22 04:06 Michaelvll