CASMcode
CASMcode copied to clipboard
casm-calc fails with duplicate database entries
An error message was stored many times into my JobDB: 'pbs_iff: cannot read reply from pbs_server\nNo Permission'. This caused db lookups to fail e.g.
db.select_regex_id("jobid", 'pbs_iff: cannot read reply from pbs_server\nNo Permission')
because the select_job function in jobdb.py can't handle duplicates. I fixed this by making a new select_job
and passing the first returned job to delete_job:
def select_duplicate_jobs(self, jobid):
if not isinstance(jobid, string_types):
print("Error in prisms_jobs.JobDB.select_job(). type(id):", type(jobid), "expected str.")
sys.exit()
self.curs.execute("SELECT * FROM jobs WHERE jobid=?", (jobid,))
import pdb; pdb.set_trace()
dupes = self.curs.fetchall() #pylint: disable=invalid-name
if len(dupes) == 0:
raise JobDBError("Error in prisms_jobs.JobDB.select_job(). jobid: '"
+ jobid + "' not found in jobs database.")
return [CompatibilityRow(r) for r in dupes]
I am also wondering if this issue could come up if the job ids on the cluster are reset/lost because the queue crashes. I noticed in the casm-calc output that it is often finding an existing JobID, but it seems to be running fine.
Update: actually, they have all failed. Maybe this is a separate issue, but casm-calc reported that a JobID was found, printed out the list of nodes, and then hung there. Deleting the job from the db and resubmitting was successful.
{'jobid': '5090221', 'jobname': 'SCEL5_5_1_1_0_1_3.1213', 'rundir': '/home/Ta\
N/casm/irrep_phonon_modes/training_data/SCEL5_5_1_1_0_1_3/1213/calctype.default', 'jobstatus': '?', 'auto': 1, 'taskstatus': 'Error: Not convergin\
g', 'continuation_jobid': '-', 'qsubstr': '#!/bin/sh\n#PBS -S /bin/sh\n#PBS -N SCEL5_5_1_1_0_1_3.1213\n#PBS -l walltime=10:00:00\n#PBS -l nodes=1:\
ppn=4\n#PBS -q batch\n#PBS -V\n#PBS -p 0\n\n#auto=True\n\necho "I ran on:"\ncat $PBS_NODEFILE\n\ncd $PBS_O_WORKDIR\npython -c "import casm.vaspwra\
pper; obj = casm.vaspwrapper.Relax.from_configuration_dir(\'/home/TaN/casm/irrep_phonon_modes/training_data/SCEL5_5_1_1_0_1_3/1213\', \'\
default\'); obj.run()"\n\n', 'qstatstr': '-', 'nodes': 1, 'procs': 4, 'walltime': 36000, 'elapsedtime': None, 'creationtime': 1663802985, 'startti\
me': None, 'completiontime': None, 'modifytime': 1663804062}