ROCm-docker icon indicating copy to clipboard operation
ROCm-docker copied to clipboard

rocm/tensorflow is too large for GitLab CI

Open Bengt opened this issue 4 years ago • 2 comments

Since upgrading to rocm/tensorflow:rocm4.0-tf2.4-dev, my pipeline jobs on GitLab.com fail:

https://gitlab.com/pfasdr/code/decoder/-/jobs/937693433 https://gitlab.com/pfasdr/code/decoder/-/jobs/937693435

The relevant error message is:

ERROR: Could not install packages due to an EnvironmentError: [Errno 28] No space left on device

As the documentation states, the shared runners on GitLab.com use

https://docs.gitlab.com/ee/user/gitlab_com/#linux-shared-runners

These have only 3.75 GB of memory and cannot download the docker image of currently 5.39 GB:

https://cloud.google.com/compute/docs/machine-types#n1_machine_types

When I run the jobs on my local machine via a GitLab runner registered to as a group runner, they execute as expected:

https://gitlab.com/pfasdr/code/decoder/-/jobs/937751331 https://gitlab.com/pfasdr/code/decoder/-/jobs/937746578

Obviously, running GitLab runner on an own machine is cumbersome. To reenable running in the cloud at GitLab CI, the image should be minified more to meet the target of somewhat under 3.75 GB.

Bengt avatar Dec 30 '20 11:12 Bengt

As a workaround, I used the rocm/dev-ubuntu-20.04 docker image, installed rccl via apt and then tensorflow-rocm via pip. Here are some successful jobs executing this approach:

https://gitlab.com/pfasdr/code/decoder/-/jobs/937928162 https://gitlab.com/pfasdr/code/decoder/-/jobs/937928161

Bengt avatar Dec 30 '20 13:12 Bengt

I created base images for use in TensorFlow ROCm projects:

https://gitlab.com/pfasdr/mesa/pfasdr_mesa_baseimage/container_registry/1598549

Bengt avatar Dec 30 '20 15:12 Bengt