aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

lsf scheduler parsing of job list should change, loosing done jobs, blocking any further work

Open broeder-j opened this issue 7 years ago • 3 comments

In the current lsf scheduler plugin the jobid gets appended to the bjobs command (for what ever reason).

This has a big draw back: If a job 'disappears' from the bjobs list, which they often do, because they do not stay in there forever after they are finished, for ever missing id a 'Job is not found' error will be raised by lsf and AiiDA will issue a scheduler error for the _parse_joblist_output.

Therefore if you stop the daemon for a while (1 day) your 1000 finished calculations are stuck in some state, for the daemon unretrievable and lost until this is fixed. ALSO any other launched/running job on this machine (by you with AiiDA) will end up with the same fate, because no states can be set anymore since _parse_joblist_output always throws this error and is done, making this resource unusable.

For checking on 'older' jobs there is a 'bhist' command.

What should be done: (applies prob to every scheduler) The bjobs output should be examined if it still contains certain/some previous job ids. If this is not so, the jobs should be marked as DONE. (Like in the direct scheduler case)

Also maybe _parase_joblist_output should be changed that it does not totally stop, if an error occurs and it still has some valid input to parse (more robustness) .

(Alternative, on could catch that error and do something, if on wants to stay with parsing job ids)

broeder-j avatar Sep 27 '17 11:09 broeder-j

I do not know if I manage to solve it today, otherwise I will use the coding week. But I would like to have some feedback of the lsf designers.

broeder-j avatar Sep 27 '17 11:09 broeder-j

In the direct scheduler case the GetJob method is extended to do this. the other methods might not need to be changed. For my purposes I uncomment the error throw and overrode GetJob like in the direct scheduler case... Through there should be a better solution...

broeder-j avatar Sep 27 '17 12:09 broeder-j

Looking back into this: The reason why it queries by passing a list of jobs and not by user is that the implementation says https://github.com/aiidateam/aiida_core/blob/d0de87cd3569b54e2d6fb8f0c12d5645bcabe96c/aiida/scheduler/plugins/lsf.py#L169

As we don't have access to LSF: do you have a command equivalent to this https://github.com/aiidateam/aiida_core/blob/d0de87cd3569b54e2d6fb8f0c12d5645bcabe96c/aiida/scheduler/plugins/lsf.py#L260 but to get all jobs of a given user, rather than by list of IDs? (probably @nmounet adapted this from another scheduler where this was not possible).

If this is possible, could you try to adapt the function (you can check other scheduler plugins that support it), set the _can_query_by_user to True, and test it a bit to see if it works? Thanks!

giovannipizzi avatar May 03 '18 07:05 giovannipizzi