MIOpen MIOpen cache issue with SLURM and multiple jobs

MIOpen cache issue with SLURM and multiple jobs

Open arkhodamoradi opened this issue 1 year ago • 6 comments

The environment is a computing cluster: Slurm 20.11.3 MI50 GPUs PyTorch 1.12.0 ROCM 5.2.0

Code (test.py):

import torch net = torch.nn.Conv2d(2, 28, 3).cuda() inp = torch.randn(20, 2, 50, 50).cuda() outputs = net(inp)

Run the code (test.py) in multiple jobs (using the sbatch command) to get this error: /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/src/include/miopen/kern_db.hpp:147: Internal error while accessing SQLite database: database disk image is malformed Traceback (most recent call last): File "/home/alirezak/playground/test.py", line 4, in outputs = net(inp) File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torchvision/models/resnet.py", line 285, in forward return self._forward_impl(x) File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torchvision/models/resnet.py", line 268, in _forward_impl x = self.conv1(x) File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 457, in forward return self._conv_forward(input, self.weight, self.bias) File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: miopenStatusInternalError

Solution: For each (sbatch) job, create the following:

experiment=some unique name HOME=some tmp folder/$experiment mkdir some tmp folder/$experiment TMPDIR=some tmp folder/$experiment/tmp mkdir some tmp folder/$experiment/tmp MIOPEN_CUSTOM_CACHE_DIR=some tmp folder/$experiment/miopen mkdir some tmp folder/$experiment/miopen

run the experiment

rm -rf some tmp folder/$experiment

My guess: The MIOpen cache includes gfx906_60.ukdb, gfx906_60.ukdb-shm, and gfx906_60.ukdb-wal files that are used/shared by multiple jobs. Is it possible to add some random number to these files per job?

Thank you

Nov 14 '22 19:11 arkhodamoradi

MIOpen MIOpen copied to clipboard

MIOpen cache issue with SLURM and multiple jobs

MIOpen
MIOpen copied to clipboard