keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

Intermittent OOM for accelerator testing on tf backend

Open mattdangerw opened this issue 2 years ago • 0 comments

We are occasionally seeing our accelerator testing fail with logs like this.

Step #5 - "create-job": keras_nlp/models/xlm_roberta/xlm_roberta_backbone_test.py::XLMRobertaBackboneTest::test_all_presets SKIPPED [ 66%]
Step #5 - "create-job": keras_nlp/models/xlm_roberta/xlm_roberta_backbone_test.py::XLMRobertaBackboneTest::test_backbone_basics PASSED [ 66%]
Step #5 - "create-job": keras_nlp/models/xlm_roberta/xlm_roberta_backbone_test.py::XLMRobertaBackboneTest::test_saved_model PASSED [ 66%]
Step #5 - "create-job": keras_nlp/models/xlm_roberta/xlm_roberta_backbone_test.py::XLMRobertaBackboneTest::test_session <- ../usr/local/lib/python3.11/dist-packages/tensorflow/python/framework/test_util.py PASSED [ 66%]
Step #5 - "create-job": + sleep 5
Step #5 - "create-job": + gcloud artifacts docker images delete us-west1-docker.pkg.dev/keras-team-test/keras-nlp-test/keras-nlp-image-tensorflow:2abfa211-75b6-48ed-a3ac-a32710d27979
Step #5 - "create-job": Digests:
Step #5 - "create-job": - us-west1-docker.pkg.dev/keras-team-test/keras-nlp-test/keras-nlp-image-tensorflow@sha256:f070f9cbebc195fca9459b840ac6c2344d17d377dfa23e5c9e08673dffedab05
Step #5 - "create-job": 
Step #5 - "create-job": Tags:
Step #5 - "create-job": - us-west1-docker.pkg.dev/keras-team-test/keras-nlp-test/keras-nlp-image-tensorflow:2abfa211-75b6-48ed-a3ac-a32710d27979
Step #5 - "create-job": 
Step #5 - "create-job": This operation will delete the above resources.
Step #5 - "create-job": 
Step #5 - "create-job": Do you want to continue (Y/n)?  
Step #5 - "create-job": Delete request issued.
Step #5 - "create-job": Waiting for operation [projects/keras-team-test/locations/us-west1/operations/f90c6b09-a56b-4528-ad36-652d2a7a3777] to complete...
Step #5 - "create-job": .......done.
Step #5 - "create-job": ++ kubectl get pod/tensorflow-keras-nlp-unit-tests-t4-x1-wgkll-t5xdm -o 'jsonpath={.status.containerStatuses[0].state.terminated.exitCode}'
Step #5 - "create-job": + exit 137

Looks like an OOM. We should increase our image size or otherwise fix.

mattdangerw avatar Nov 06 '23 23:11 mattdangerw