MIOpen
MIOpen copied to clipboard
MIOpen cache issue with SLURM and multiple jobs
The environment is a computing cluster: Slurm 20.11.3 MI50 GPUs PyTorch 1.12.0 ROCM 5.2.0
Code (test.py):
import torch net = torch.nn.Conv2d(2, 28, 3).cuda() inp = torch.randn(20, 2, 50, 50).cuda() outputs = net(inp)
Run the code (test.py) in multiple jobs (using the sbatch command) to get this error:
/long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/src/include/miopen/kern_db.hpp:147: Internal error while accessing SQLite database: database disk image is malformed
Traceback (most recent call last):
File "/home/alirezak/playground/test.py", line 4, in
Solution: For each (sbatch) job, create the following:
experiment=some unique name HOME=some tmp folder/$experiment mkdir some tmp folder/$experiment TMPDIR=some tmp folder/$experiment/tmp mkdir some tmp folder/$experiment/tmp MIOPEN_CUSTOM_CACHE_DIR=some tmp folder/$experiment/miopen mkdir some tmp folder/$experiment/miopen
run the experiment
rm -rf some tmp folder/$experiment
My guess: The MIOpen cache includes gfx906_60.ukdb, gfx906_60.ukdb-shm, and gfx906_60.ukdb-wal files that are used/shared by multiple jobs. Is it possible to add some random number to these files per job?
Thank you