aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

lsf scheduler parsing of job list should change, loosing done jobs, blocking any further work

Open broeder-j opened this issue 6 years ago • 3 comments

In the current lsf scheduler plugin the jobid gets appended to the bjobs command (for what ever reason).

This has a big draw back: If a job 'disappears' from the bjobs list, which they often do, because they do not stay in there forever after they are finished, for ever missing id a 'Job is not found' error will be raised by lsf and AiiDA will issue a scheduler error for the _parse_joblist_output.

Therefore if you stop the daemon for a while (1 day) your 1000 finished calculations are stuck in some state, for the daemon unretrievable and lost until this is fixed. ALSO any other launched/running job on this machine (by you with AiiDA) will end up with the same fate, because no states can be set anymore since _parse_joblist_output always throws this error and is done, making this resource unusable.

For checking on 'older' jobs there is a 'bhist' command.

What should be done: (applies prob to every scheduler) The bjobs output should be examined if it still contains certain/some previous job ids. If this is not so, the jobs should be marked as DONE. (Like in the direct scheduler case)

Also maybe _parase_joblist_output should be changed that it does not totally stop, if an error occurs and it still has some valid input to parse (more robustness) .

(Alternative, on could catch that error and do something, if on wants to stay with parsing job ids)

broeder-j avatar Sep 27 '17 11:09 broeder-j