skypilot
skypilot copied to clipboard
TPU: investigate whether on-demand TPUs get killed occasionally
A TPU user mentioned that for TPUs, even with on-demand TPU, it will be killed at any time within every 2 days, and no logs can be found for the reason. Quote: "Before deadline, I have to get up in the midnight to see if my job being killed, due to the failure of TPU. It would be good thing to look at for the spot feature."
We should first investigate this issue (e.g., low cost sky tpunode --tpus tpu-v2-8
) and see how we can solve this problem.
To reproduce, I launched five TPU VMs and ran a bert training job for over a week. So far all of them look normal. We probably need more inputs from users on this.