hail
hail copied to clipboard
[batch] Batch charges for private instance creation that fails with exhausted resource errors.
What happened?
Due to limited GPU availability, it is common for GPU private jobs (esp. preemptible) to fail multiple times with exhausted resource errors before obtaining a VM. When this happens, Batch still changes for the attempt. An example is batch 8166586, job 1, attempt ZMkGaS, instance ID batch-worker-default-job-private-u4fxc which failed with ZONE_RESOURCE_POOL_EXHAUSTED.
Version
SaaS
Relevant log output
No response