CASMcode icon indicating copy to clipboard operation
CASMcode copied to clipboard

casm-calc fails with duplicate database entries

Open xivh opened this issue 2 years ago • 0 comments

An error message was stored many times into my JobDB: 'pbs_iff: cannot read reply from pbs_server\nNo Permission'. This caused db lookups to fail e.g.

db.select_regex_id("jobid", 'pbs_iff: cannot read reply from pbs_server\nNo Permission')

because the select_job function in jobdb.py can't handle duplicates. I fixed this by making a new select_job and passing the first returned job to delete_job:

def select_duplicate_jobs(self, jobid):
    if not isinstance(jobid, string_types):
        print("Error in prisms_jobs.JobDB.select_job(). type(id):", type(jobid), "expected str.")
        sys.exit()
    self.curs.execute("SELECT * FROM jobs WHERE jobid=?", (jobid,))
    import pdb; pdb.set_trace()
    dupes = self.curs.fetchall()    #pylint: disable=invalid-name                                                                                  
    if len(dupes) == 0:
        raise JobDBError("Error in prisms_jobs.JobDB.select_job(). jobid: '"
                         + jobid + "' not found in jobs database.")
    return [CompatibilityRow(r) for r in dupes]

I am also wondering if this issue could come up if the job ids on the cluster are reset/lost because the queue crashes. I noticed in the casm-calc output that it is often finding an existing JobID, but it seems to be running fine.

Update: actually, they have all failed. Maybe this is a separate issue, but casm-calc reported that a JobID was found, printed out the list of nodes, and then hung there. Deleting the job from the db and resubmitting was successful.

{'jobid': '5090221', 'jobname': 'SCEL5_5_1_1_0_1_3.1213', 'rundir': '/home/Ta\
N/casm/irrep_phonon_modes/training_data/SCEL5_5_1_1_0_1_3/1213/calctype.default', 'jobstatus': '?', 'auto': 1, 'taskstatus': 'Error: Not convergin\
g', 'continuation_jobid': '-', 'qsubstr': '#!/bin/sh\n#PBS -S /bin/sh\n#PBS -N SCEL5_5_1_1_0_1_3.1213\n#PBS -l walltime=10:00:00\n#PBS -l nodes=1:\
ppn=4\n#PBS -q batch\n#PBS -V\n#PBS -p 0\n\n#auto=True\n\necho "I ran on:"\ncat $PBS_NODEFILE\n\ncd $PBS_O_WORKDIR\npython -c "import casm.vaspwra\
pper; obj = casm.vaspwrapper.Relax.from_configuration_dir(\'/home/TaN/casm/irrep_phonon_modes/training_data/SCEL5_5_1_1_0_1_3/1213\', \'\
default\'); obj.run()"\n\n', 'qstatstr': '-', 'nodes': 1, 'procs': 4, 'walltime': 36000, 'elapsedtime': None, 'creationtime': 1663802985, 'startti\
me': None, 'completiontime': None, 'modifytime': 1663804062}

xivh avatar Sep 21 '22 23:09 xivh