skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

TPU: investigate whether on-demand TPUs get killed occasionally

Open concretevitamin opened this issue 2 years ago • 1 comments

A TPU user mentioned that for TPUs, even with on-demand TPU, it will be killed at any time within every 2 days, and no logs can be found for the reason. Quote: "Before deadline, I have to get up in the midnight to see if my job being killed, due to the failure of TPU. It would be good thing to look at for the spot feature."

We should first investigate this issue (e.g., low cost sky tpunode --tpus tpu-v2-8) and see how we can solve this problem.

concretevitamin avatar Jun 08 '22 00:06 concretevitamin

To reproduce, I launched five TPU VMs and ran a bert training job for over a week. So far all of them look normal. We probably need more inputs from users on this.

infwinston avatar Aug 16 '22 22:08 infwinston