seed_rl icon indicating copy to clipboard operation
seed_rl copied to clipboard

Grpc is incompatible with tf2 if tf2 was built from source

Open treeeeke opened this issue 4 years ago • 6 comments

Hello, I have tried to use seed_rl framework on local machine:

Ubuntu 18.04.5 LTS AMD Ryzen 9 3900x 12-core GeForce RTX 3080/PCIe/SSE2 (Gigabyte)

I installed nvidia driver 455, cuda 11.1 and Cudnn 8.0.4 according to instructions from nvidia official website.

As far as I can tell, there aren’t tensorflow 2 builds compatible with Cuda 11.1 for now. So I build tf2 version from source using one of the latest tf-nightly commits (tf-nightly==2.5.0.dev20201025, tf.git_version == v1.12.1-44562-g33335ad96). After building tf2 tf.test.is_gpu_available() returns True and simple convnet tf2 examples also work as well as tests from cudnn samples.

I cannot start training in docker using my tf2 built from source, because I cannot use my wheel while building docker image – it raise the following error: “wheel is not supported on this platform”. So I start training locally (local training without docker works perfectly fine with CPU only and tf==2.2), however when using seed_rl with GPU and tf2 built from sources it raises an error:

tensorflow.python.framework.errors_impl.NotFoundError: /home/nono/PycharmProjects/seed_rl/grpc/python/../grpc_cc.so: undefined symbol: _ZNK10tensorflow8OpKernel11TraceStringERKNS_15OpKernelContextEb

So I built grpc using builder from seed_rl repository (using tf-nightly==2.5.0.dev20201025 as tensoflow version), just adapted script a little according to https://github.com/google-research/seed_rl/issues/14 and it was built successfully and passed tests. However I experienced the same error with this grpc build and my tf2 version built from source (from the same commit as tf-nightly==2.5.0.dev20201025). I want to mention that if I’m using tf-nightly==2.5.0.dev20201025 as my local tf2 (instead of built from source) grpc works fine (however it is not compatible with 11.1 cuda and so my GPU).

In conclusion: my problem is that RTX 3080 is compatible only with CUDA 11.1, tf2 compatible with CUDA 11.1 can only be build from source, however built from source tf2 doesn’t work with any grpc I built (even one that was built in container with tf-nightly built from the same commit as my tf2 built).


I also tried CUDA 10.1 and CUDA 11.0. With CUDA 10.1 seed_rl starts very slow (however this can be partially fixed with CUDA_CACHE_MAXSIZE=2147483648) and gives very strange results: looks like something wrong with computations: trained on GPU models achieve different (usually very bad) results from trained on CPU ones and sometimes GPU’s results looks wrong. With CUDA 11.0 tensorflow gives a lot of warnings and then crushing with “out of memory” message.

grpc_dockerfile.txt

treeeeke avatar Oct 25 '20 19:10 treeeeke

" just adapted script a little according to #14 and it was built successfully and passed tests. " <- it passed the grpc tests? based on building with tf-nightly==2.5.0.dev20201025?

lespeholt avatar Oct 28 '20 06:10 lespeholt

How did you make it build on tf-nightly?

Try and change the base image of Docker.grpc to

nightly-custom-op-gpu-ubuntu16

lespeholt avatar Oct 28 '20 06:10 lespeholt

Hello. I have already used "nightly-custom-op-gpu-ubuntu16" as you can see in attached to my issue file "grpc_dockerfile.txt".

treeeeke avatar Nov 04 '20 15:11 treeeeke

" just adapted script a little according to #14 and it was built successfully and passed tests. " <- it passed the grpc tests? based on building with tf-nightly==2.5.0.dev20201025?

Yes, it passed grpc tests while image based on nightly-custom-op-ubuntu16 as grpc_compile and i installed tf-nightly==2.5.0.dev20201025 by "RUN pip3 install tf-nightly==2.5.0.dev20201025". I think you can reproduce it using my grpc_dockerfile which i attached to the issue.

treeeeke avatar Nov 04 '20 15:11 treeeeke

Still not working.

mooihock avatar Nov 14 '20 15:11 mooihock

It seems to work for me:

Use your docker file for Dockerfile.grpc ./grpc/build.sh (to update grpc_cc.so and service_pb2.py) Add RUN pip3 install tf-nightly==2.5.0.dev20201025 in e.g. Dockerfile.atari ./run_local.sh atari r2d2 4

lespeholt avatar Nov 17 '20 11:11 lespeholt

@treeeeke "With CUDA 11.0 tensorflow gives a lot of warnings and then crushing with “out of memory” message."

I'm having the same problem, did u find a solution?

YHL04 avatar Nov 10 '22 15:11 YHL04

@mooihi @treeeeke Is it possible that nightly-custom-op-gpu-ubuntu16 is not compatible with ubuntu18?

YHL04 avatar Nov 13 '22 17:11 YHL04