aiida-core
aiida-core copied to clipboard
lsf scheduler parsing of job list should change, loosing done jobs, blocking any further work
In the current lsf scheduler plugin the jobid gets appended to the bjobs command (for what ever reason).
This has a big draw back:
If a job 'disappears' from the bjobs list, which they often do, because they do not stay in there forever after they are finished, for ever missing id a 'Job
Therefore if you stop the daemon for a while (1 day) your 1000 finished calculations are stuck in some state, for the daemon unretrievable and lost until this is fixed. ALSO any other launched/running job on this machine (by you with AiiDA) will end up with the same fate, because no states can be set anymore since _parse_joblist_output always throws this error and is done, making this resource unusable.
For checking on 'older' jobs there is a 'bhist' command.
What should be done: (applies prob to every scheduler) The bjobs output should be examined if it still contains certain/some previous job ids. If this is not so, the jobs should be marked as DONE. (Like in the direct scheduler case)
Also maybe _parase_joblist_output should be changed that it does not totally stop, if an error occurs and it still has some valid input to parse (more robustness) .
(Alternative, on could catch that error and do something, if on wants to stay with parsing job ids)