pyiron_base
pyiron_base copied to clipboard
Job ID on Remote HPC not matching local ID
I am unfortunately not succeeding in getting the remote job submission to work properly. I am running pyiron version 0.4.7 installed via conda and pyiron_base 0.5.32 installed via pip from the git repo at that tag.
I have followed the steps in the docs in order to send jobs to a HPC over ssh. For this, I have set the DISABLE_DATABASE=TRUE
as the documentation suggests.
Right now I am testing if I can get a simple minimization job to work, which i submit via
n_type1 = 11
n_type2 = 1
box_length = 22.2
potential = lj_potential
minim = project.create_job("LammpsWL", f'minimization{n_type1}_{n_type2}',\
delete_existing_job=True)
unit_cell = box_length * np.eye(3)
positions = np.random.random((n_type1 + n_type2, 3)) * box_length
random_start = project.create.structure.atoms(elements=n_type1 * ['Ar'] + n_type2 * ['Kr'],
positions=positions,
cell=unit_cell, pbc=True)
minim.structure = random_start
minim.potential = potential
minim.calc_minimize(
ionic_energy_tolerance=0.0,
ionic_force_tolerance=1e-4,
e_tol=None,
f_tol=None,
max_iter=100000,
pressure=None,
n_print=100,
style="cg",)
minim.server.queue = 'queue_one'
minim.server.cpus = 1
minim.server.run_mode = 'queue'
minim.run()
This successfully pushes the job to the cluster, in exactly the working directory that I expect and it runs also flawlessly until the job status gets changed to 'collect'
.
During the collection, the following error can be seen in the output:
IndexError: list index out of range (pyiron_base/database/filetable.py line 121)
This seems to be caused by the job table expecting the job to have id 1 (if I check project.db._job_table
this is the only job id that exists on the HPC cluster) . However the job id in the slurm queue is pi_5123
or something like that, probably caused by the fact that this is the id that the job would have gotten on my local machine, from which I have submitted the job. Hence, the entire communication between the machines breaks at this point.
Is there something in the setup that I have missed? Should I somehow set the id on my local machine to start at 0 again?
In that theme: Is it possible to submit series ("Flexible
" pyiron jobs, which are connected by step function) over ssh?
I fear this problem is on our end: It is reported already in https://github.com/pyiron/pyiron_base/issues/791... @Leimeroth has the most experience running pyiron in the remote setup - is there a workaround which can be used?
Collecting (parsing the output) currently does not work with the remote setup. As a workaround you can copy the output of jobs back and collect them locally using pr.update_from_remote(try_collecting=True) Regard Flexible jobs I am not sure, but what works for ParallelMaster derived jobs (for example ElasticTensor) is to set the queue for the reference job but not the master job itself. Than all child jobs are submitted separately and can again be retrieved using update_from_remote
Ok, thank you for your suggestions so far. I will try what you have suggested so far (maybe this should be reflected in the docs somewhere ?). I will update if the flexible jobs work here, however submitting only the master job-as you say-did not work so far.
It should look something like this
ref = pr.create_job("Vasp", "Ref")
ref.structure = structure
ref.server.cores = 24
ref.server.queue = "normal_N1"
job = ref.create_job("ElasticTensor", "ElasticTensor")
job.run()
Just for me to understand the current configuration, locally you use an SQL database and on the remote cluster you use pyiron without a database, correct?
Yes, exactly, that is my configuration, a local workstation SQL database and a remote cluster without a database. Performing calculations on the cluster and then "importing" them with pr.update_from_remote(try_collecting=True)
works flawlessly. However, if you look remotely into the jobs on the cluster, you see that the error log shows that collecting on the cluster did not work. I assumed at the time that this is what caused my problems, as I did not know that I had to collect the jobs locally on the workstation by the call to update from the remote machines.
I assumed at the time that this is what caused my problems, as I did not know that I had to collect the jobs locally on the workstation by the call to update from the remote machines.
There currently is a bug that prevents the job from being found correctly in the filetable on the cluster. Therefore it cannot update its status to collect and throws an error. When the corresponding bug is fixed the workaround of locally collecting jobs won't be necessary anymore.
@ligerzero-ai Can you take a look at this? As pysqa
can return the working directory https://github.com/pyiron/pysqa/pull/143 it should be able to fix this bug by matching jobs in pyiron based on their working directory.
@ligerzero-ai and I recently worked on this, can you check if https://github.com/pyiron/pyiron_base/pull/1067 helps to fix your bug.