xpk icon indicating copy to clipboard operation
xpk copied to clipboard

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.

Results 33 xpk issues
Sort by recently updated
recently updated
newest added

## Fixes / Features - Enable Cloud DNS on Pathways clusters. - Add user job conditionally based on whether headless mode is defined or not. - Print out proxy address...

## Fixes / Features - remove nccl plugin installation in workload creation on A3 - enable cluster and workload creation on A3+ ## Testing / Documentation Manual tests passed for...

## Fixes / Features - hot fix for reservation - enable create workload for h150 - remove nccl plugin installation in workload creation ## Testing / Documentation Testing details. -...

## Fixes / Features - Add podFailurePolicy so SIGTERMed Pathways workers do not count against the default backoffLimit (4). ## Testing / Documentation TODO - [ y ] Tests pass...

Improvements: 1. Use enum for accelerator type 2. remove usage of device type based on --tpu-type / --device-type check everywhere. Do this in one place. 3. Remove h100 device specific...

## Fixes / Features - Remove default values for proxy-server and server images. - Ensure user provides both proxy-server-image and server-image when --use-pathways is set and vice-versa. - Validate that...

## Fixes - Add gpu_multi_process_run.sh ## Testing / Documentation Testing details. - [ y ] Tests pass - [ y ] Appropriate changes to documentation are included in the PR

## Fixes / Features - Supports tpu-topology flag for specifying custom topologys for TPUs - ## Testing / Documentation Added a check to make sure the format of topologies fits...

N queues, 1 per slice size, 1 cluster. (This is complicated!)

xpk currently supports one cluster queue / local queue. If an xpk cluster administrator wants to split capacity between different use cases, they would currently have to create separate xpk...