xpk
xpk copied to clipboard
xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
## Fixes / Features - Enable Cloud DNS on Pathways clusters. - Add user job conditionally based on whether headless mode is defined or not. - Print out proxy address...
## Fixes / Features - remove nccl plugin installation in workload creation on A3 - enable cluster and workload creation on A3+ ## Testing / Documentation Manual tests passed for...
## Fixes / Features - hot fix for reservation - enable create workload for h150 - remove nccl plugin installation in workload creation ## Testing / Documentation Testing details. -...
## Fixes / Features - Add podFailurePolicy so SIGTERMed Pathways workers do not count against the default backoffLimit (4). ## Testing / Documentation TODO - [ y ] Tests pass...
Improvements: 1. Use enum for accelerator type 2. remove usage of device type based on --tpu-type / --device-type check everywhere. Do this in one place. 3. Remove h100 device specific...
## Fixes / Features - Remove default values for proxy-server and server images. - Ensure user provides both proxy-server-image and server-image when --use-pathways is set and vice-versa. - Validate that...
## Fixes - Add gpu_multi_process_run.sh ## Testing / Documentation Testing details. - [ y ] Tests pass - [ y ] Appropriate changes to documentation are included in the PR
## Fixes / Features - Supports tpu-topology flag for specifying custom topologys for TPUs - ## Testing / Documentation Added a check to make sure the format of topologies fits...
N queues, 1 per slice size, 1 cluster. (This is complicated!)
xpk currently supports one cluster queue / local queue. If an xpk cluster administrator wants to split capacity between different use cases, they would currently have to create separate xpk...