pyiron_base
pyiron_base copied to clipboard
RFC: Bulk delete of jobs
We discussed a few times already that deleting a lot of jobs takes too much time. Removing thousands of jobs can take a few hours.
Here's an idea of how to do it better:
- get a list of jobs to be deleted
- prun or enrich the list based on linked jobs (master, parent, etc.)
- prun the list based on job status (e.g. running or submitted)
- make a backup of the database entries
- delete database entries in bulk (doing in it sequentially already only takes a few tens of seconds for a ~1k jobs)
- delete the working directories of each job, aborting once there's an error
- if there's been an error restore the database entries of jobs that have been skipped.
Optionally we could stop after 2. & 3. to be cautious.
Some points why I picked this order
-
- comes after 1. and 2. because jobs added in 2. might not be finished yet.
-
- is done to avoid a potential race condition between multiple pyiron processes where a working directory is deleted already but the database is not updated yet.
-
- & 7. are there to restore the database entries of jobs that have not been deleted on disk because of errors.
I'll comment here in support of speeding this up. From my end restarting from a set of bulk calculations, and using pyiron takes an absurd amount of time - it's taking ~3 minutes with delete_exisiting_jobs=True vs ~1 second to submit jobs if there is no existing job. I am finding it faster to prune out the jobs that didn't previously finish before restarting the pyiron workflow. Having a faster job removal can be key for having "resubmit-able" workflows.
I have a pull request for this open https://github.com/pyiron/pyiron_base/pull/1182 maybe you can check if it solves your issue, then we should move forward and include it in the next release.
I'm guessing #1182 and #1215 won't help much with the use case of something like
for _ in ...:
j = pr.create.job.MyJob(..., delete_existing_job=True)
j.run()
since it will still touch every job sequentially.
@mgt16-LANL Can you run your workflow (re-)submission under cProfile
and send the resulting profile here?
import cProfile
prof = cProfile.Profile()
prof.enable()
... # your workflow
prof.disable()
prof.dump_stats('myprofile.pyprof')
I have a pull request for this open #1182 maybe you can check if it solves your issue, then we should move forward and include it in the next release.
@jan-janssen - Unfortunately @pmrv is right. For context - I am running thousands of calculations in a single project without the POSTGRES database backend enabled. I am intentionally giving each run more jobs that will finish during the walltime and would like to simply restart the ones that don't finish. In the process of debugging why it was taking 3 minutes for each job to start upon resubmission in a ~10,000 job submission vs. 1 second on initial submission I found that more than likely the bulk of time was being taken on remove each job. E.g. pr.remove_job(job_id) took more than 2 minutes. On a smaller project (~3000 calculations, where it definitely runs faster) I ran the following with the cProfile:
import cProfile
from pyiron import Project
from tqdm import tqdm
prof = cProfile.Profile()
prof.enable()
pr = Project('others_async_adf')
table = pr.job_table()
adf_non_finish = table[(table.status != 'finished') & (table.job.str.contains('adf'))]
for i,row in tqdm(adf_non_finish.iloc[0:50].iterrows(),total=adf_non_finish.iloc[0:50].shape[0]):
pr.remove_job(row['id'])
prof.disable()
prof.dump_stats('myprofile.pyprof')
For 50 calculations removal this was the tqdm output: 100%|██████████| 50/50 [09:21<00:00, 11.24s/it] pyprof.zip
And the .pyprof file is attached (zipped).
My temporary workaround to make this go faster is to just hardcode a separate script that clears out the directories where I know they didn't finish with pathlib, which is much faster than the pr.remove_job() route - but I think it's much cleaner for re-submissions if pyiron executed faster on removing jobs.
Edit/note: The especially slow deletion (e.g. 2 mins) might have been occurring when the filesystem was having issues - this is on a system where it can be hard tell. Either way, 11s/deletion is way too slow for a serial set of submissions -> adds up to hours of delay on thousands of jobs.
Edit2: Tracked down timing when I tried this on the larger project:
for i,row in tqdm(adf_non_finish.iterrows(),total=adf_non_finish.shape[0]):
pr.remove_job(row['id'])
tqdm output: 0%| | 4/8698 [05:12<190:42:47, 78.97s/it]
pyiron.log output (initial submission): 2023-10-13 01:41:45,399 - pyiron_log - INFO - run job: Gd0_arch id: None, status: initialized 2023-10-13 01:41:45,731 - pyiron_log - INFO - run job: Gd0_arch id: None, status: created 2023-10-13 01:41:45,735 - pyiron_log - INFO - run job: Gd1_arch id: None, status: initialized 2023-10-13 01:41:46,060 - pyiron_log - INFO - run job: Gd1_arch id: None, status: created pyiron.log output (re-submission): 2023-10-19 13:00:04,173 - pyiron_log - INFO - run job: Er116_adf_0 id: None, status: initialized 2023-10-19 13:05:26,037 - pyiron_log - INFO - run job: Er116_adf_0 id: None, status: created 2023-10-19 13:05:26,104 - pyiron_log - INFO - run job: Lu116_adf_0 id: None, status: initialized 2023-10-19 13:11:28,282 - pyiron_log - INFO - run job: Lu116_adf_0 id: None, status: created
Ah, you are using FileTable
. According to the profile most time (~80%) is spent here where it essentially updated the whole job table on a get_child_ids
query (in turn probably for each job). I know you don't run a central database, but would switching to the sqlite
backend be an option?
Thanks @pmrv, I am a bit hesitant to switch to sqlite
backend since I'll often have jobs going in multiple project submissions at the same time, so I'm a bit concerned about file locking. My workaround code was:
from tqdm import tqdm
from pathlib import Path
import os.path as osp
import shutil
p = Path('others_async_adf/')
# File names will not get compressed to .tar.bz2 when job is not complete.
paths = [x for x in p.rglob('input.ams')]
print('Total_unfinished',len(paths))
for path in tqdm(paths,total=len(paths)):
# Get rid of directory
rmdir = osp.join('.',path.parts[0],path.parts[1])
shutil.rmtree(rmdir)
# Get rid of .h5 of job name.
unlink_path = Path(osp.join(path.parts[0],path.parts[2] + '.h5'))
unlink_path.unlink()
This took on the order of 1 min to clear out over ~9000 unfinished jobs. To basically just find the files existing only unfinished jobs and remove both the directory and .h5 files. This might be a useful workaround for others.