pyiron_base icon indicating copy to clipboard operation
pyiron_base copied to clipboard

Default job name of restarted simulations

Open samwaseda opened this issue 1 year ago • 5 comments

I'm currently working on Damask where restarting is part of job routines. In the current setting, the default job name is job_restart if the job name of the old job is job. This causes a bit of problems when the job has to be let's say run 10 times, which will be something like job_restart_restart_restart_..._restart. I'd like to change it but I also wanted to hear your opinions

  • Possibility 0: Leave it like this
  • Possibility 1: Replace it by rerun_{index} (or restart, but rerun is shorter)
    • This makes it difficult to bundle all the simulations, because the first run won't contain rerun, so if we use a for loop it's gonna be like all_jobs = [pr.load("job")] + [pr.load(f"job_rerun_{counter}") for counter in range(5)] (which is still WAY easier than the current setting).
    • We could potentially check if job_restart exists and make a warning for the backward compatibility
  • Possibility 2: Create a new entry in the job table for the counter
    • It's probably a nightmare for the backward compatibility
    • We wouldn't have a problem with user accidentally using rerun or restart in the job name (because in the option above we would have to parse rerun or restart)

samwaseda avatar May 15 '24 13:05 samwaseda

In the job table we have the parent ID this should print to the job the current job was restarted from.

jan-janssen avatar May 15 '24 14:05 jan-janssen

In the job table we have the parent ID this should print to the job the current job was restarted from.

That's true, but it's not particularly helpful when you want to have a list of connected jobs by list comprehension or other methods. Especially the user usually wants to use the main job and have all the subsequent jobs, and not other way around. In that case it's difficult to figure out which ones are the child jobs. Anyway, it doesn't solve my original problem that the job names become irrationally long.

samwaseda avatar May 15 '24 14:05 samwaseda

* Possibility 1: Replace it by `rerun_{index}` (or `restart`, but `rerun` is shorter)
  
  * This makes it difficult to bundle all the simulations, because the first run won't contain `rerun`, so if we use a for loop it's gonna be like `all_jobs = [pr.load("job")] + [pr.load(f"job_rerun_{counter}") for counter in range(5)]` (which is still WAY easier than the current setting).

You could even add a convenience method on parent jobs like

@property 
def all_re_run_names(self) -> list[str]:
    all_names = [self.name]
    for name in self.project.job_table()["job_name"]:
        if name.startswith(self.name + "_rerun_"):
            all_names.append(name)

Or similar -- there is probably a more elegant formulation and I'm not even sure the syntax here is correct, but you get the idea.

liamhuber avatar May 15 '24 15:05 liamhuber

  • Possibility 0: Leave it like this

I guess this way you will run into the column character length limit pretty fast

liamhuber avatar May 15 '24 15:05 liamhuber

In the job table we have the parent ID this should print to the job the current job was restarted from.

That's true, but it's not particularly helpful when you want to have a list of connected jobs by list comprehension or other methods. Especially the user usually wants to use the main job and have all the subsequent jobs, and not other way around. In that case it's difficult to figure out which ones are the child jobs. Anyway, it doesn't solve my original problem that the job names become irrationally long.

I think that's probably only one SQL query (though I'd need to look it up), that we could bake into a method on a job, like job.get_all_children(). If this is easy, I think this would be my preferred solution as well. Then you could go for the second option for the naming.

Alternatively we could also make restart use job.child_project, so that they all live inside the folder of the original job. We had once discussed something similar for the repaired jobs from HandyMan.

pmrv avatar May 15 '24 16:05 pmrv

Another point I had wanted to make is, that I think restart is doing way too many different things:

  1. restart aborted jobs to fix inputs
  2. restart dft jobs for different post processing runs, e.g. band structure calculations
  3. continue MD runs

So I would suggest that you add a new method to the Damask job that does only the things that make sense there and then we see if we can generalize it in base or else where (if necessary at all).

pmrv avatar May 24 '24 14:05 pmrv

Current problem with backward compatibility:

job = pr.create.job.Sphinx("job")
...
job.run()

new_job = job.restart()
new_job.run()

@jan-janssen's solution: Child ID should be checked before the child runs.

samwaseda avatar Jun 03 '24 13:06 samwaseda

@pmrv's solution: Change the behaviour in Damask:

class DamaskJob:
   def rerun_loading(self, job_name=None):
        if job_name is None:
            if "restart" not in self.job_name:
                job_name = "_".join(self.job_name.split("_")[:-1]) + "_" + int(self.job_name.split("_")[-1]) + 1
            self:
                job_name = self.job_name + "restart_0"
        return self.restart(job_name=job_name)

samwaseda avatar Jun 03 '24 13:06 samwaseda

Tasked with writing a method to return all jobs that originate from a given job based on the job_table ChatGPT gave this

import pandas as pd

# Sample DataFrame
data = {'id': [1, 2, 3, 4, 5, 6],
        'parentid': [None, 1, 1, 2, 3, 4]}
df = pd.DataFrame(data)

def find_all_descendants(df, target_id):
    all_related_ids = set()
    
    def find_children(parent_ids):
        children = df.query('parentid in @parent_ids')['id']
        if children.empty:
            return set()
        return set(children).union(find_children(children))

    all_related_ids = find_children({target_id})
    
    return df.query('id in @all_related_ids')

# Usage
target_id = 1
result = find_all_descendants(df, target_id)
print(result)

Not sure if it is useful.

pmrv avatar Jun 03 '24 14:06 pmrv

I guess the conclusion of my original question is that we don't change the behaviour of restart. Is that correct?

samwaseda avatar Jun 03 '24 14:06 samwaseda