axlearn icon indicating copy to clipboard operation
axlearn copied to clipboard

nodeSelector set by default requires tpu provisioner

Open samos123 opened this issue 1 year ago • 0 comments

These are the nodeSelectors that got added:

Node-Selectors:              cloud.google.com/gke-accelerator-count=4
                             cloud.google.com/gke-spot=true
                             cloud.google.com/gke-tpu-accelerator=tpu-v5-lite-podslice
                             cloud.google.com/gke-tpu-topology=16x16
                             provisioner-nodepool-id=stoelinga-8733bd

This was my launch job:

export BASTION_TIER=1
axlearn gcp gke start --instance_type=tpu-v5litepod-256 --num_replicas=1 \
        --cluster=v5e-256-bodaborg-us-west4 --bundler_spec=allow_dirty=True \
        --bundler_type=artifactregistry --bundler_spec=image=tpu \
        --bundler_spec=dockerfile=Dockerfile --bundler_spec=target=tpu \
        -- python3 -c "'import jax; print(jax.devices())'"

Expectation: The job should not have this selector provisioner-nodepool-id=stoelinga-8733bd since that assumes the tpu provisioner is always used. This may not be the case for external users.

samos123 avatar Aug 02 '24 23:08 samos123