d2l-en icon indicating copy to clipboard operation
d2l-en copied to clipboard

Flaky chapter_computational-performance auto-parallelism.md

Open AnirudhDagar opened this issue 3 years ago • 1 comments

The section on auto-parallelism sometimes fails and raise out of memory CUDA runtime errors on CI randomly. See the failing CI for more details here.

Pitch: Maybe reduce the size of the tensors. cc @astonzhang

AnirudhDagar avatar Jun 25 '21 11:06 AnirudhDagar

Since the MXNet implementation does not fail using the same setting, can you dive deep to find out the root cause?

RuntimeError: CUDA out of memory. Tried to allocate 62.00 MiB (GPU 0; 15.75 GiB total capacity; 5.93 GiB already allocated; 31.12 MiB free; 5.93 GiB reserved in total by PyTorch)

astonzhang avatar Jun 25 '21 17:06 astonzhang