torchx icon indicating copy to clipboard operation
torchx copied to clipboard

Azure batch scheduler implementation

Open kurman opened this issue 2 years ago • 3 comments

  1. Modify commands to initialize scheduler with options that can be defined in config. Generally most of the schedulers can operate using scheduler options, however in some cases for multi-tenant setup operations needs to be pre-configured. For example, Azure Batch allows multiple Batch accounts
  2. Initial implementation of the Azure batch scheduler. Provides support for scheduling, stopping, listing and describing jobs. Azure Batch has a good support HPC support and training jobs should work out of the box.

Major things that needs to be addressed:

  • Support for patching using Docker (currently it can run existing image)
  • Creating autoscaling pools that maps to resource requirements

Testing

  • Unit test
  • python -m torchx.cli.main run --scheduler azure_batch --scheduler_args image_repo=ghcr.io/pytorch/torchx utils.echo --image alpine:latest --msg hello
Screenshot 2023-02-08 at 7 15 27 PM Screenshot 2023-02-08 at 7 15 43 PM Screenshot 2023-02-08 at 7 15 51 PM

kurman avatar Feb 09 '23 03:02 kurman

Codecov Report

Merging #688 (1227cf6) into main (ef01789) will decrease coverage by 0.50%. The diff coverage is 73.10%.

@@            Coverage Diff             @@
##             main     #688      +/-   ##
==========================================
- Coverage   92.47%   91.98%   -0.50%     
==========================================
  Files          82       83       +1     
  Lines        5664     5801     +137     
==========================================
+ Hits         5238     5336      +98     
- Misses        426      465      +39     
Impacted Files Coverage Δ
torchx/schedulers/__init__.py 95.23% <ø> (ø)
torchx/schedulers/azure_batch_scheduler.py 67.50% <67.50%> (ø)
torchx/cli/argparse_util.py 100.00% <100.00%> (ø)
torchx/cli/cmd_cancel.py 100.00% <100.00%> (ø)
torchx/cli/cmd_describe.py 100.00% <100.00%> (ø)
torchx/cli/cmd_list.py 100.00% <100.00%> (ø)
torchx/cli/cmd_log.py 95.83% <100.00%> (+0.08%) :arrow_up:
torchx/cli/cmd_run.py 88.88% <100.00%> (+0.09%) :arrow_up:
torchx/cli/cmd_runopts.py 90.90% <100.00%> (+0.90%) :arrow_up:
torchx/cli/cmd_status.py 96.77% <100.00%> (+0.22%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Feb 09 '23 03:02 codecov[bot]

@kurman any plans to actually commit this?

kiukchung avatar Jan 02 '24 19:01 kiukchung