Sam Stoelinga

Results 133 issues of Sam Stoelinga

Upgrade the Dockerfile to use Ubuntu 22.04 instead of 20.04. The current image has and old version of curl which is missing the `--fail-with-body` flag and also has several high...

Some pods receive a lot more requests than other pods ![image](https://github.com/substratusai/lingo/assets/388784/4c2a75df-7ee1-4d5f-b02c-2c763e0de2ee) For example in the picture pod/llama-3-8b-instruct-vllm-7d884cfcd9-7ngdr is running 143 requests and has 582 pending requests. However, pod pod/llama-3-8b-instruct-vllm-7d884cfcd9-fdrm5 often...

``` 2024/07/05 07:03:53 sending error response: 400: unable to parse model: unmarshal json: unexpected end of JSON ``` The above message was seen and I suspect it's due to health...

This helped increase v5e-512 performance from 50% MFU to 58% MFU. The gains could be more significant when going beyond 2 slices. Majority of Google TPU benchmarks ran on GKE...

Currently axlearn either adds a nodeSelector for spot=true or it adds a nodeSelector for reservation: ``` if tier == "0" and cfg.reservation is not None: logging.info("Found tier=%s in env. Using...

These are the nodeSelectors that got added: ``` Node-Selectors: cloud.google.com/gke-accelerator-count=4 cloud.google.com/gke-spot=true cloud.google.com/gke-tpu-accelerator=tpu-v5-lite-podslice cloud.google.com/gke-tpu-topology=16x16 provisioner-nodepool-id=stoelinga-8733bd ``` This was my launch job: ``` export BASTION_TIER=1 axlearn gcp gke start --instance_type=tpu-v5litepod-256 --num_replicas=1 \...

Testing here until GHA is enabled on axlearn: https://github.com/samos123/axlearn/pull/1 The GHA are working except clip_test.py seems to be broken but I suspect it's broken on circleCI too. I would prefer...