submitit icon indicating copy to clipboard operation
submitit copied to clipboard

When 'submitit' meet 'mpirun', there will be a very strange BUG.

Open yinkaaiwu opened this issue 2 years ago • 0 comments

Hello, I recently tried to use submitit to manage my VASP jobs, but I encountered a strange issue.

Problem Description: After executing the test.py, when I check the job status using 'squeue,' it shows that my job is "running," but in reality, no VASP processes have started, which is quite strange. This code doesn't produce any error messages, but I've ruled out issues with environment variables and still can't find the possible bug. However, I noticed that if I change args='mpirun -np xx vasp_gam' to 'vasp_gam', it successfully starts a vasp_gam process. If I launch it in the slurm script with mpirun -np -48 vasp_gam, it also executes successfully. This indicates that submitit, mpirun, slurm, and vasp_gam individually work fine, but they don't work together as expected. I hope to get a solution from you. Thank you!

Here is the code from test.py:

import submitit
import time
from subprocess import Popen, PIPE

def startvasp(cwd):
    process = Popen(
        args=['mpirun','-np','48','vasp_gam'],
        shell=False,
        stdin=PIPE,
        stdout=PIPE,
        stderr=PIPE,
        cwd=cwd,
        universal_newlines=True,
        bufsize=0
    )
    stdout, stderr = process.communicate()
    return process.pid, process.poll(), stdout, stderr


executor = submitit.AutoExecutor(folder="log_test")
executor.update_parameters(
    timeout_min=3600,
    slurm_partition="CLUSTER",
    nodes=1,
    tasks_per_node=1,
    cpus_per_task=48,
    slurm_job_name="test1",
    slurm_setup=[
        "source /home/xxx/intel/parallel_studio_xe_2020.4.912/psxevars.sh intel64",
        "source /home/xxx/intel/mpi2015/profile.d/mpi_intelmpi-5.0.2.044.sh",
        "ulimit -sunlimited",
        "export PATH=/home/xxx/Applications/vasp/vasp.6.3.2/bin:$PATH",
    ]
)

job = executor.submit(startvasp, '/home/xxx/test')

time.sleep(2)
print(job.get_info())
print(job.result())

This is the work_dir test looks like, even though the Job state is running but there is only VASP input file, which means that VASP processes aren't being started:

(base) [redhat@gpu test]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               131   CLUSTER         test1       fwtop  R       0:43      1 hpc-1-806
(base) [redhat@gpu test]$ ls
INCAR  KPOINTS  POSCAR  POTCAR  vasp.script

yinkaaiwu avatar Dec 10 '23 16:12 yinkaaiwu