swift
swift copied to clipboard
Support Ubuntu 20.04
Ubuntu 20.04 LTS was released on April 23, 2020. It would be nice to support this latest LTS version.
Here's what I've needed to do to get version 0.11 working on ubuntu 20.04:
sudo apt install libncurses5 libtinfo5
So maybe just adding that to the installation instructions for now would be a good start. Updating the code to support the newer libs would be another option.
It seems the python support also doesn't work on 20.04 because it's looking for libpython3.6m.so.1.0. 20.04 comes with python3.8.2 and there's no easy way to get python 3.6.
It seems the python support also doesn't work on 20.04 because it's looking for libpython3.6m.so.1.0. 20.04 comes with python3.8.2 and there's no easy way to get python 3.6.
Can you tell me what specifically you did to encounter this problem, so that I can make sure that the ubuntu20.04 builds don't have this problem?
Tried running swift-jupyter as described here.
When starting the kernel, I saw errors like:
[I 09:42:54.199 NotebookApp] Kernel started: 1a8e1196-b812-4582-9bf8-e42fe72ef654, name: swift
Traceback (most recent call last):
File "/home/garymm/swift-tensorflow/usr/lib/python3/dist-packages/lldb/__init__.py", line 35, in <module>
import _lldb
ModuleNotFoundError: No module named '_lldb'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/garymm/src/swift-jupyter/swift_kernel.py", line 19, in <module>
import lldb
File "/home/garymm/swift-tensorflow/usr/lib/python3/dist-packages/lldb/__init__.py", line 38, in <module>
from . import _lldb
ImportError: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory
[I 09:42:57.200 NotebookApp] KernelRestarter: restarting kernel (1/5), new random ports
Traceback (most recent call last):
File "/home/garymm/swift-tensorflow/usr/lib/python3/dist-packages/lldb/__init__.py", line 35, in <module>
import _lldb
ModuleNotFoundError: No module named '_lldb'
I think the issue of python 3.6 vs 3.8 was a symptom of me trying to use a release that was built on Ubuntu 18.04 on 20.04.
I built the toolchain from source and got a build to succeed on 20.04 with CUDA 11.0 and CUDNN 8.0.2. The only real bug I had to fix is described here: https://groups.google.com/a/tensorflow.org/g/swift/c/RUlBncvPRfE
I made some progress: https://github.com/tensorflow/swift/pull/535
I'm still waiting on https://gitlab.com/nvidia/container-images/cuda/-/issues/83 before I can add cuda toolchains for ubuntu 20.04.
@marcrasi toolchains have been updated!
I tried to make a CUDA build for ubuntu20.04, but there is still a small blocker: The version of TF that we use (2.3) supports CUDA 11.0 but not CUDA 11.1, and nvidia publishes docker images for ubuntu20.04 CUDA 11.1 but not CUDA 11.0.
I'm not sure if TF 2.4 supports CUDA 11.1, but I'll try again once we upgrade to TF 2.4 (which we're trying to do soon)
@marcrasi it's my understanding that 2.4 is the first release that officially supports cuda 11.0 (https://github.com/tensorflow/tensorflow/releases/tag/v2.4.0), not sure how you got 11.0 working in the first place (a master pull?). Cuda 11.1 is the release that supports the new ampere consumer cards (11.0 is just for the a100 series), so it would be nice to have that in particular (https://github.com/tensorflow/tensorflow/issues/44750). 11.2 is already out as well!
also, @texasmichelle
you might run this and look at the logs being spit out:
export GPU_TYPE="a100"
export ZONE="us-central1-a"
gcloud compute instances create s4tf-ubuntu-${GPU_TYPE} \
--zone=${ZONE} \
--image-project=deeplearning-platform-release \
--image-family=swift-latest-gpu-ubuntu-1804 \
--maintenance-policy=TERMINATE \
--accelerator="type=nvidia-tesla-${GPU_TYPE},count=1" \
--metadata="install-nvidia-driver=True" \
--machine-type=a2-highgpu-1g \
--boot-disk-size=256GB
@brettkoonce Can you share what you're seeing? I'm getting a warning about disk size, but otherwise that command seems to be working. Are you running in a project that has quota?
Or are you pointing this out as an example of a toolchain running with cuda 11 support?
@texasmichelle I was seeing some weird errors when running swift-models (eg lenet-mnist), but in retrospect what's going on is that I think you packaged the 10.2 cuda version with your deep learning build. After pulling the cuda 11 build (eg swift-tensorflow-RELEASE-0.12-cuda11.0-cudnn8-ubuntu18.04.tar.gz) everything works fine. It might be worth considering moving to 11.0 going forward. Still seeing https://github.com/tensorflow/swift-models/issues/704 fwiw.
ah, I see what you mean. I also tried using --image-family=swift-latest-cu110-ubuntu-1804
, which seems fine on the tensorflow-0.12
branch of swift-models. However, I can see that the 0.12 release hasn't made it into the images yet. There's currently a code freeze for the holidays, but I'll see if I can get a more precise date on the next release. I submitted the change a few weeks ago, so I believe the code is ready otherwise.
@brettkoonce You can expect to see DLVMs with v0.12 right after the freeze, e.g. by Jan. 8.
I also verified that cuda 11.0 is included in the existing toolchain and will remain going forward.
1 week ago =>
Ubuntu20.04 x86_64 cudnn images have been pushed! Having an issue with arm64 and ppc64le builds though. Will close this once those are released.
So could we got ubuntu precompiled with cuda (preferably 11.1 version for amper support :D [
nvidia/cuda:11.1-cudnn8-devel-ubuntu20.04
] ), or we still need to wait for 11.1 version in the master Tensorflow repo?