When 'submitit' meet 'mpirun', there will be a very strange BUG.
Hello, I recently tried to use submitit to manage my VASP jobs, but I encountered a strange issue.
Problem Description:
After executing the test.py, when I check the job status using 'squeue,' it shows that my job is "running," but in reality, no VASP processes have started, which is quite strange. This code doesn't produce any error messages, but I've ruled out issues with environment variables and still can't find the possible bug. However, I noticed that if I change args='mpirun -np xx vasp_gam' to 'vasp_gam', it successfully starts a vasp_gam process. If I launch it in the slurm script with mpirun -np -48 vasp_gam, it also executes successfully. This indicates that submitit, mpirun, slurm, and vasp_gam individually work fine, but they don't work together as expected. I hope to get a solution from you. Thank you!
Here is the code from test.py:
import submitit
import time
from subprocess import Popen, PIPE
def startvasp(cwd):
process = Popen(
args=['mpirun','-np','48','vasp_gam'],
shell=False,
stdin=PIPE,
stdout=PIPE,
stderr=PIPE,
cwd=cwd,
universal_newlines=True,
bufsize=0
)
stdout, stderr = process.communicate()
return process.pid, process.poll(), stdout, stderr
executor = submitit.AutoExecutor(folder="log_test")
executor.update_parameters(
timeout_min=3600,
slurm_partition="CLUSTER",
nodes=1,
tasks_per_node=1,
cpus_per_task=48,
slurm_job_name="test1",
slurm_setup=[
"source /home/xxx/intel/parallel_studio_xe_2020.4.912/psxevars.sh intel64",
"source /home/xxx/intel/mpi2015/profile.d/mpi_intelmpi-5.0.2.044.sh",
"ulimit -sunlimited",
"export PATH=/home/xxx/Applications/vasp/vasp.6.3.2/bin:$PATH",
]
)
job = executor.submit(startvasp, '/home/xxx/test')
time.sleep(2)
print(job.get_info())
print(job.result())
This is the work_dir test looks like, even though the Job state is running but there is only VASP input file, which means that VASP processes aren't being started:
(base) [redhat@gpu test]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
131 CLUSTER test1 fwtop R 0:43 1 hpc-1-806
(base) [redhat@gpu test]$ ls
INCAR KPOINTS POSCAR POTCAR vasp.script