mlsh
mlsh copied to clipboard
Termination due to `Bus error (signal 7)`
Hi,
I'm running the code using mpirun inside a docker container. It worked at first but recently I started getting the error message
=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 135
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Bus error (signal 7)
As far as I'm aware of I didn't change anything. Does anyone know where this might be coming from? Thanks a lot! Best, Max
It seems to not happen when I run with fewer parallel runs (30 instead of 50 or 40 on a server that supports up to 56 threads).
Hi! My computer has two gpus, When I run this code, the utilization of the first one is only 5%. The second one is even zero. Do you know why my GPU utilization is so low? Do I need to modify the code appropriately according to the configuration of each computer? Thanks!
Maybe this error stems from the nodes we used.
Same thing here bro. For all I know it seems it has something to do with the docker default shared memory. I am not a 100% sure yet but right now I increased the container shared memory from 64Mb to many Gb to test.
Same thing here bro. For all I know it seems it has something to do with the docker default shared memory. I am not a 100% sure yet but right now I increased the container shared memory from 64Mb to many Gb to test.
It works for me. When run the docker image, I add --shm-size=2000g