PiPPy icon indicating copy to clipboard operation
PiPPy copied to clipboard

pytests_test_gpu(0) will fail if allocated a non-4 gpu server - add guard/skip?

Open lessw2020 opened this issue 3 years ago • 0 comments

In running the pytests for a recent PR, I was allocated a 3 gpu server rather than 4 gpu. (presumably a bad gpu on a 4 gpu server, but unclear if this is a new allocation option). 3_gpu

This odd number gpu count causes the current block of pytests_gpu(0) to fail as the device mesh attempts to reshape into a [2,2] block, which isn't possible with 3 gpus.

i.e. error: tau_reshape_error

b/c of:

gpu_fail_mesh

This issue is to track potentially adding an auto check for world size to skip the tests if allocated an unexpected config (i.e. 3) or else err out with an informative error rather than a series of failing tests.

lessw2020 avatar Nov 15 '22 14:11 lessw2020