hpc-rocket icon indicating copy to clipboard operation
hpc-rocket copied to clipboard

Work in progress: Support PBS scheduler

Open ignatiusm opened this issue 1 year ago • 8 comments

Related to issue #37.

Thanks for developing HPC Rocket! I'd like to help with adding features (specifically around adding PBS support). Before I go too far, I thought I'd submit this draft Pull Request to start a discussion.

This PR so far includes:

  • Slight refactoring of config to generalise from sbatch to job
  • Refactoring of tests to work with the new config arrangement
  • Adding config to allow specifying a scheduler
  • Adding pbscontroller and pbsbatchjob
  • Tweaks to UI to partially handle different logs from PBS
  • Some partial edits to docs with new config settings

As noted in the title, this is very much 'work in progress'. Please excuse the partial implementation (particularly for config parsing, and conditions in hpcrocket/core/application.py). I wanted to get something started and then create the PR so we had a place to discuss things.

Please let me know if you are interested in contributions from the other side of the world, and if so, feel free to make any suggestions for how best to proceed. :)

ignatiusm avatar Jun 07 '24 01:06 ignatiusm

Thanks a lot for the contribution! Work towards supporting PBS is very welcome as I don't have access to a cluster using it and haven't the time to set up a test environment myself.

I'll review your PR within the next week and will get back to you as soon as I can!

SvenMarcus avatar Jun 07 '24 10:06 SvenMarcus

Yay! I'm pleased this PR is welcome. I realise that I need to get the tests passing again, so I'll do some more work on that.

Also, I know a good source for dockerised PBS set up. Let me know if you'd like the details (I guess that would require #33 to be resolved first).

ignatiusm avatar Jun 09 '24 21:06 ignatiusm

Yay! I'm pleased this PR is welcome. I realise that I need to get the tests passing again, so I'll do some more work on that.

I have a pretty good idea of what needs to be done for that, but it'll require some work to build that flexibility and still satisfy mypy. I think I can work on getting the existing tests working today, but we still need some for the PbsController and PbsBatchJob. Could you implement some test cases for those classes? We don't need a real PBS system for that, if you have a look in the test folder, there is a slurmoutput sub directory containing examples of common slurm output. It would be great if you could provide something similar for PBS.

Also, I know a good source for dockerised PBS set up. Let me know if you'd like the details (I guess that would require #33 to be resolved first).

That would be amazing! I've had some trouble with this guide, but maybe that's because I'm working with an Apple Silicon processor. I think we can also get around implementing the local executor if we simply enable the SSH daemon in the container.

SvenMarcus avatar Jun 12 '24 09:06 SvenMarcus

I extended your PR to make the tests pass. Please rebase onto this branch before making more changes: https://github.com/SvenMarcus/hpc-rocket/tree/Garvan-Data-Science-Platform-add-pbs-controller

There are lots of changed files, but the changes mostly deal with introducing generalized Scheduler, BatchJob, JobStatus and TaskStatus classes and protocols (now found in the new core/schedulers package). After refactoring a bit, PbsBatchJob and SlurmBatchJob turned out to be exactly the same, so I deleted those and we can simply use the generalized BatchJob class now.

I have also slightly changed your proposed format of the config file by renaming file to script. It is now also a yaml object instead of an array.

job:
    script: jobfile.sh
    scheduler: slurm

That's because hpc-rocket already supported this (undocumented) format to avoid adding an extra copy instruction for the slurm script:

job:
   script: remote_jobfile.sh
   from: local_jobfile_path.sh

SvenMarcus avatar Jun 12 '24 10:06 SvenMarcus

Hi @SvenMarcus ,

Also, I know a good source for dockerised PBS set up. Let me know if you'd like the details (I guess that would require #33 to be resolved first).

That would be amazing! I've had some trouble with this guide, but maybe that's because I'm working with an Apple Silicon processor. I think we can also get around implementing the local executor if we simply enable the SSH daemon in the container.

I have borrowed the set up from the dask-jobqueue repo ci test set up. They have PBS, SLURM and SGE configurations. They are a bit more complex than is needed for here, but I've typically just followed the steps in the jobqueue_before_install function in pbs.sh. I haven't attempted to simplify this more yet.

ignatiusm avatar Jun 18 '24 21:06 ignatiusm

I have borrowed the set up from the dask-jobqueue repo ci test set up. They have PBS, SLURM and SGE configurations. They are a bit more complex than is needed for here, but I've typically just followed the steps in the jobqueue_before_install function in pbs.sh. I haven't attempted to simplify this more yet.

Thanks! I'll have a look at it as soon as I find some time. That might take a while, though. But I think we can still move ahead with the PR if we add some tests with dummy PBS output.

SvenMarcus avatar Jun 24 '24 09:06 SvenMarcus

I have borrowed the set up from the dask-jobqueue repo ci test set up. They have PBS, SLURM and SGE configurations. They are a bit more complex than is needed for here, but I've typically just followed the steps in the jobqueue_before_install function in pbs.sh. I haven't attempted to simplify this more yet.

Thanks! I'll have a look at it as soon as I find some time. That might take a while, though. But I think we can still move ahead with the PR if we add some tests with dummy PBS output.

Apologies for the super slow progress with this. I've been caught up in other things too alas. I've pushed some dummy pbs output. I'll work on tests early next week.

ignatiusm avatar Jun 27 '24 06:06 ignatiusm

Apologies for the super slow progress with this. I've been caught up in other things too alas. I've pushed some dummy pbs output. I'll work on tests early next week.

No need to rush, take all the time you need! I really appreciate that you're taking the time to contribute to hpc-rocket! 🙂

Let me know if you run into any trouble getting things working!

SvenMarcus avatar Jul 02 '24 15:07 SvenMarcus