pyiron_base icon indicating copy to clipboard operation
pyiron_base copied to clipboard

Job ID on Remote HPC not matching local ID

Open pstaerk opened this issue 1 year ago • 9 comments

I am unfortunately not succeeding in getting the remote job submission to work properly. I am running pyiron version 0.4.7 installed via conda and pyiron_base 0.5.32 installed via pip from the git repo at that tag.

I have followed the steps in the docs in order to send jobs to a HPC over ssh. For this, I have set the DISABLE_DATABASE=TRUE as the documentation suggests.

Right now I am testing if I can get a simple minimization job to work, which i submit via

n_type1 = 11
n_type2 = 1
box_length = 22.2
potential = lj_potential
minim = project.create_job("LammpsWL", f'minimization{n_type1}_{n_type2}',\
                      delete_existing_job=True)

unit_cell = box_length * np.eye(3)

positions = np.random.random((n_type1 + n_type2, 3)) * box_length

random_start = project.create.structure.atoms(elements=n_type1 * ['Ar'] + n_type2 * ['Kr'],
                    positions=positions,
                    cell=unit_cell, pbc=True)

minim.structure = random_start
minim.potential = potential
minim.calc_minimize(
    ionic_energy_tolerance=0.0,
    ionic_force_tolerance=1e-4,
    e_tol=None,
    f_tol=None,
    max_iter=100000,
    pressure=None,
    n_print=100,
    style="cg",)

minim.server.queue = 'queue_one'
minim.server.cpus = 1
minim.server.run_mode = 'queue'
minim.run()

This successfully pushes the job to the cluster, in exactly the working directory that I expect and it runs also flawlessly until the job status gets changed to 'collect'.

During the collection, the following error can be seen in the output:

IndexError: list index out of range (pyiron_base/database/filetable.py line 121)

This seems to be caused by the job table expecting the job to have id 1 (if I check project.db._job_table this is the only job id that exists on the HPC cluster) . However the job id in the slurm queue is pi_5123 or something like that, probably caused by the fact that this is the id that the job would have gotten on my local machine, from which I have submitted the job. Hence, the entire communication between the machines breaks at this point.

Is there something in the setup that I have missed? Should I somehow set the id on my local machine to start at 0 again?

In that theme: Is it possible to submit series ("Flexible" pyiron jobs, which are connected by step function) over ssh?

pstaerk avatar Mar 16 '23 14:03 pstaerk

I fear this problem is on our end: It is reported already in https://github.com/pyiron/pyiron_base/issues/791... @Leimeroth has the most experience running pyiron in the remote setup - is there a workaround which can be used?

niklassiemer avatar Mar 16 '23 15:03 niklassiemer

Collecting (parsing the output) currently does not work with the remote setup. As a workaround you can copy the output of jobs back and collect them locally using pr.update_from_remote(try_collecting=True) Regard Flexible jobs I am not sure, but what works for ParallelMaster derived jobs (for example ElasticTensor) is to set the queue for the reference job but not the master job itself. Than all child jobs are submitted separately and can again be retrieved using update_from_remote

Leimeroth avatar Mar 16 '23 15:03 Leimeroth

Ok, thank you for your suggestions so far. I will try what you have suggested so far (maybe this should be reflected in the docs somewhere ?). I will update if the flexible jobs work here, however submitting only the master job-as you say-did not work so far.

pstaerk avatar Mar 16 '23 15:03 pstaerk

It should look something like this

ref = pr.create_job("Vasp", "Ref")
ref.structure = structure
ref.server.cores = 24
ref.server.queue = "normal_N1"

job = ref.create_job("ElasticTensor", "ElasticTensor")
job.run()

Leimeroth avatar Mar 17 '23 06:03 Leimeroth

Just for me to understand the current configuration, locally you use an SQL database and on the remote cluster you use pyiron without a database, correct?

jan-janssen avatar Mar 17 '23 14:03 jan-janssen

Yes, exactly, that is my configuration, a local workstation SQL database and a remote cluster without a database. Performing calculations on the cluster and then "importing" them with pr.update_from_remote(try_collecting=True) works flawlessly. However, if you look remotely into the jobs on the cluster, you see that the error log shows that collecting on the cluster did not work. I assumed at the time that this is what caused my problems, as I did not know that I had to collect the jobs locally on the workstation by the call to update from the remote machines.

pstaerk avatar Apr 13 '23 08:04 pstaerk

I assumed at the time that this is what caused my problems, as I did not know that I had to collect the jobs locally on the workstation by the call to update from the remote machines.

There currently is a bug that prevents the job from being found correctly in the filetable on the cluster. Therefore it cannot update its status to collect and throws an error. When the corresponding bug is fixed the workaround of locally collecting jobs won't be necessary anymore.

Leimeroth avatar Apr 13 '23 09:04 Leimeroth

@ligerzero-ai Can you take a look at this? As pysqa can return the working directory https://github.com/pyiron/pysqa/pull/143 it should be able to fix this bug by matching jobs in pyiron based on their working directory.

jan-janssen avatar Apr 13 '23 14:04 jan-janssen

@ligerzero-ai and I recently worked on this, can you check if https://github.com/pyiron/pyiron_base/pull/1067 helps to fix your bug.

jan-janssen avatar Apr 21 '23 17:04 jan-janssen