ert icon indicating copy to clipboard operation
ert copied to clipboard

Handling of `QUEUE_OPTION [..] QUEUE` in Scheduler

Open pinkwah opened this issue 1 year ago • 1 comments

We can make the following assumptions:

  1. Each HPC system has a way to choose a named queue.
  2. Each HPC system has a default queue it chooses. Ie, qsub /usr/bin/true will execute on some queue even though it's not specified.
  3. The user may enter their preferred queue, which the driver must attempt to use.
  4. The user may enter incorrect information.

The LocalDriver is an exception, but we may pretend it has a queue called local.

We may check whether the queue is valid and exit early if there is an issue with their chosen queue. To do this, we should extend Driver with the following method:

    async def use_queue(self, queue_name: str) -> None:
        """
        Submit jobs to `queue_name` queue.

        Raises:
            ValueError if there is an issue with the queue
        """

For LocalDriver this function does nothing. LSFDriver may run bqueues to verify that the user's queue exists, and raise ValueError (or an appropriate exception) if it doesn't.

This makes it possible to check that the queue seems okay long before we submit any jobs. Maybe we can have a @classmethod function check_queue which is ran when the GUI starts up, so we can show an error message to the user.

pinkwah avatar Feb 05 '24 15:02 pinkwah

Emphasizing the importance of (at least some) pre-validation I've noticed error messages in the logs of the following type:

Exception in scheduler task job-8_task: Command .... failed after 10 retries with exit code 255, output: "<empty>", and error: "mr7: No such queue. Job not submitted

where the logs could get easily cluttered when this one fails on all realization naturally.

xjules avatar Jun 25 '24 11:06 xjules

Referenced in this one: #8116

xjules avatar Aug 19 '24 12:08 xjules

@sondreso do you think it is fine to close this one as there is not a reasonable and easy way to check it?

xjules avatar Aug 26 '24 12:08 xjules

Why is the way using bqueues as outlined in the issue not an option? 🤔

sondreso avatar Aug 29 '24 12:08 sondreso

Why is the way using bqueues as outlined in the issue not an option? 🤔

bqueues is flaky and can still fail, ie. not reliable source of working queues. Additionally this would prolong the validation step substantially. @berland was there anything else we discussed?

xjules avatar Sep 02 '24 10:09 xjules

Fixing this issue has merely been downprioritized, it is not impossible to do. The PoC was with using qsub directly which could give the same kind of information that bqueues could do (well, it cannot list the allowed queue names though).

The upside is not clear, it will reduce the log output to the screen for those running GUI if we are willing to wait for the status for the checks.

berland avatar Sep 02 '24 10:09 berland

Fixing this issue has merely been downprioritized, it is not impossible to do.

This was my impression as well, and then I don't think the issue should be closed.

The intent of this issue, to give early and precise feedback to the user in case of problems with the queue system, is something we should strive for. (This issue is perhaps focused on the technical side of the problem, but the issue that was closed as a duplicate of this one is more focused on the user experience: https://github.com/equinor/ert/issues/8116). While we might not be able to do this in the suggester due to performance reasons, there is still a lot of room for accurate error messages in the case something goes wrong when submitting jobs.

Also, if we close issues due to technical reasons or implementation difficulty, we should document why in the issue. That makes it a lot easier to re-assess the issue in the future if assumption change.

sondreso avatar Sep 02 '24 13:09 sondreso