Sam Stoelinga
Sam Stoelinga
Upgrade the Dockerfile to use Ubuntu 22.04 instead of 20.04. The current image has and old version of curl which is missing the `--fail-with-body` flag and also has several high...
Some pods receive a lot more requests than other pods  For example in the picture pod/llama-3-8b-instruct-vllm-7d884cfcd9-7ngdr is running 143 requests and has 582 pending requests. However, pod pod/llama-3-8b-instruct-vllm-7d884cfcd9-fdrm5 often...
``` 2024/07/05 07:03:53 sending error response: 400: unable to parse model: unmarshal json: unexpected end of JSON ``` The above message was seen and I suspect it's due to health...
This helped increase v5e-512 performance from 50% MFU to 58% MFU. The gains could be more significant when going beyond 2 slices. Majority of Google TPU benchmarks ran on GKE...
This fixes #621
Currently axlearn either adds a nodeSelector for spot=true or it adds a nodeSelector for reservation: ``` if tier == "0" and cfg.reservation is not None: logging.info("Found tier=%s in env. Using...
These are the nodeSelectors that got added: ``` Node-Selectors: cloud.google.com/gke-accelerator-count=4 cloud.google.com/gke-spot=true cloud.google.com/gke-tpu-accelerator=tpu-v5-lite-podslice cloud.google.com/gke-tpu-topology=16x16 provisioner-nodepool-id=stoelinga-8733bd ``` This was my launch job: ``` export BASTION_TIER=1 axlearn gcp gke start --instance_type=tpu-v5litepod-256 --num_replicas=1 \...
Testing here until GHA is enabled on axlearn: https://github.com/samos123/axlearn/pull/1 The GHA are working except clip_test.py seems to be broken but I suspect it's broken on circleCI too. I would prefer...