torchx
torchx copied to clipboard
Azure batch scheduler implementation
- Modify commands to initialize scheduler with options that can be defined in config. Generally most of the schedulers can operate using scheduler options, however in some cases for multi-tenant setup operations needs to be pre-configured. For example, Azure Batch allows multiple Batch accounts
- Initial implementation of the Azure batch scheduler. Provides support for scheduling, stopping, listing and describing jobs. Azure Batch has a good support HPC support and training jobs should work out of the box.
Major things that needs to be addressed:
- Support for patching using Docker (currently it can run existing image)
- Creating autoscaling pools that maps to resource requirements
Testing
- Unit test
-
python -m torchx.cli.main run --scheduler azure_batch --scheduler_args image_repo=ghcr.io/pytorch/torchx utils.echo --image alpine:latest --msg hello
data:image/s3,"s3://crabby-images/e4922/e4922a23c305d72247e41e90726a91a6f22eb2e2" alt="Screenshot 2023-02-08 at 7 15 27 PM"
data:image/s3,"s3://crabby-images/047ad/047ad180098dae7ebf629f3789270599c47999c1" alt="Screenshot 2023-02-08 at 7 15 43 PM"
data:image/s3,"s3://crabby-images/5877a/5877ab4251fd91099b7aab0e72c187b1e469e7b0" alt="Screenshot 2023-02-08 at 7 15 51 PM"
Codecov Report
Merging #688 (1227cf6) into main (ef01789) will decrease coverage by
0.50%
. The diff coverage is73.10%
.
@@ Coverage Diff @@
## main #688 +/- ##
==========================================
- Coverage 92.47% 91.98% -0.50%
==========================================
Files 82 83 +1
Lines 5664 5801 +137
==========================================
+ Hits 5238 5336 +98
- Misses 426 465 +39
Impacted Files | Coverage Δ | |
---|---|---|
torchx/schedulers/__init__.py | 95.23% <ø> (ø) |
|
torchx/schedulers/azure_batch_scheduler.py | 67.50% <67.50%> (ø) |
|
torchx/cli/argparse_util.py | 100.00% <100.00%> (ø) |
|
torchx/cli/cmd_cancel.py | 100.00% <100.00%> (ø) |
|
torchx/cli/cmd_describe.py | 100.00% <100.00%> (ø) |
|
torchx/cli/cmd_list.py | 100.00% <100.00%> (ø) |
|
torchx/cli/cmd_log.py | 95.83% <100.00%> (+0.08%) |
:arrow_up: |
torchx/cli/cmd_run.py | 88.88% <100.00%> (+0.09%) |
:arrow_up: |
torchx/cli/cmd_runopts.py | 90.90% <100.00%> (+0.90%) |
:arrow_up: |
torchx/cli/cmd_status.py | 96.77% <100.00%> (+0.22%) |
:arrow_up: |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
@kurman any plans to actually commit this?