custom-op icon indicating copy to clipboard operation
custom-op copied to clipboard

Is there a way to build the custom op on my LINUX 16.0 with tensorflow-gpu==1.15 and cuda-8.0

Open BruceDai003 opened this issue 5 years ago • 2 comments

I know that in your project description, you mentioned that one has to download the docker image provided in order to build the custom op successfully. I first tried to use 'make' to build the op. It only works for the ZeroOut CPU op, on my own machine with tf 1.15. But what I really want to do is build a GPU op on my own machine with tf 1.15. Besides, I used your provided docker image, and tried to use 'make' to build the TimeTwo GPU op, it didn't work, complaining the unkonwn -fPIC flag issue:

root@5f0d055f7746:/custom-op# make time_two_op
nvcc -std=c++11 -c -o tensorflow_time_two/python/ops/_time_two_ops.cu.o tensorflow_time_two/cc/kernels/time_two_kernels.cu.cc  -I/usr/local/lib/python3.6/dist-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -O2 -std=c++11 -L/usr/local/lib/python3.6/dist-packages/tensorflow -l:libtensorflow_framework.so.2 -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC -DNDEBUG --expt-relaxed-constexpr
nvcc fatal   : Unknown option 'fPIC'
Makefile:35: recipe for target 'tensorflow_time_two/python/ops/_time_two_ops.cu.o' failed
make: *** [tensorflow_time_two/python/ops/_time_two_ops.cu.o] Error 1

So, I removed the -fPIC flag. (Notice that there are two -fPIC flags in the nvcc command, I have to remove both to make this error disappear.) So I copied the command, remove -fPIC flag, and run to get another error:

root@5f0d055f7746:/custom-op# nvcc -std=c++11 -c -o tensorflow_time_two/python/ops/_time_two_ops.cu.o tensorflow_time_two/cc/kernels/time_two_kernels.cu.cc  -I/usr/local/lib/python3.6/dist-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=0 -O2 -std=c++11 -L/usr/local/lib/python3.6/dist-packages/tensorflow -l:libtensorflow_framework.so.2 -D GOOGLE_CUDA=1 -x cu -Xcompiler -DNDEBUG --expt-relaxed-constexpr
In file included from tensorflow_time_two/cc/kernels/time_two_kernels.cu.cc:21:0:
/usr/local/lib/python3.6/dist-packages/tensorflow/include/tensorflow/core/util/gpu_kernel_helper.h:22:53: fatal error: third_party/gpus/cuda/include/cuda_fp16.h: No such file or directory
compilation terminated.

It seems that in this gpu_kernel_helper.h file, it has to include cuda_fp16.h header file. I can't find this file in the docker image, but found some in my host machine, so I copied one into the directory third_party/gpus/cuda/include (I don't have /gpus/cuda/include directory under third_party actually).

root@5f0d055f7746:/usr/local/lib/python3.6/dist-packages/tensorflow/include/third_party/gpus/cuda/include# ll
total 112
drwxr-sr-x 2 root staff     25 Aug  7 06:42 ./
drwxr-sr-x 3 root staff     21 Aug  7 06:42 ../
-rw-r--r-- 1 root staff 114479 Aug  7 06:42 cuda_fp16.h

Now running the nvcc command again:

root@5f0d055f7746:/custom-op# nvcc -std=c++11 -c -o tensorflow_time_two/python/ops/_time_two_ops.cu.o tensorflow_time_two/cc/kernels/time_two_kernels.cu.cc  -I/usr/local/lib/python3.6/dist-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=0 -O2 -std=c++11 -L/usr/local/lib/python3.6/dist-packages/tensorflow -l:libtensorflow_framework.so.2 -D GOOGLE_CUDA=1 -x cu -Xcompiler -DNDEBUG --expt-relaxed-constexpr
In file included from /usr/local/lib/python3.6/dist-packages/tensorflow/include/tensorflow/core/util/gpu_kernel_helper.h:25:0,
                 from tensorflow_time_two/cc/kernels/time_two_kernels.cu.cc:21:
/usr/local/lib/python3.6/dist-packages/tensorflow/include/tensorflow/core/util/gpu_device_functions.h:34:53: fatal error: third_party/gpus/cuda/include/cuComplex.h: No such file or directory
compilation terminated.

Well, this file is also in the same directory in my host machine, in case some other headers are needed, so I copied all contents into the required directory on the docker image.

Now I got at least 100 errors( which is too long to paste them all here, so I pasted some of them here for reference):

root@5f0d055f7746:/custom-op# nvcc -std=c++11 -c -o tensorflow_time_two/python/ops/_time_two_ops.cu.o tensorflow_time_two/cc/kernels/time_two_kernels.cu.cc  -I/usr/local/lib/python3.6/dist-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=0 -O2 -std=c++11 -L/usr/local/lib/python3.6/dist-packages/tensorflow -l:libtensorflow_framework.so.2 -D GOOGLE_CUDA=1 -x cu -Xcompiler -DNDEBUG --expt-relaxed-constexpr
/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.h(129): error: invalid redeclaration of type name "__half"
/usr/local/lib/python3.6/dist-packages/tensorflow/include/third_party/gpus/cuda/include/cuda_fp16.h(96): here

/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.h(140): error: invalid redeclaration of type name "__half2"
/usr/local/lib/python3.6/dist-packages/tensorflow/include/third_party/gpus/cuda/include/cuda_fp16.h(100): here

/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.h(155): error: cannot overload functions distinguished by return type alone

/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.h(183): error: cannot overload functions distinguished by return type alone

/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.h(198): error: cannot overload functions distinguished by return type alone

/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.h(213): error: cannot overload functions distinguished by return type alone

...
/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(166): error: class "__half" has no member "__x"
...
/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(360): error: cannot overload functions distinguished by return type alone
...
/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(552): error: no suitable user-defined conversion from "__half2" to "__half2" exists
/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2053): error: invalid redeclaration of type name "half"
/usr/local/lib/python3.6/dist-packages/tensorflow/include/third_party/gpus/cuda/include/cuda_fp16.h(103): here

/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2054): error: invalid redeclaration of type name "half2"
/usr/local/lib/python3.6/dist-packages/tensorflow/include/third_party/gpus/cuda/include/cuda_fp16.h(104): here
/usr/local/lib/python3.6/dist-packages/tensorflow/include/unsupported/Eigen/CXX11/../../../Eigen/src/Core/arch/Default/TypeCasting.h(25): error: no suitable user-defined conversion from "__half" to "Eigen::half" exists
/usr/local/lib/python3.6/dist-packages/tensorflow/include/unsupported/Eigen/CXX11/../../../Eigen/src/Core/arch/GPU/PacketMath.h(1005): error: no suitable user-defined conversion from "__half2" to "half2" exists

Error limit reached.
100 errors detected in the compilation of "/tmp/tmpxft_000000d7_00000000-6_time_two_kernels.cu.cpp1.ii".
Compilation terminated.

The errors all seem to relate to the files we just copied them to. But I don't know what's wrong and how to solve it.

Then I turned to use bazel. Finally I successfully built the TimeTwo GPU op under this tensorflow/tensorflow:custom-op-gpu-ubuntu16 docker image. But I want to modify the bazel code in order to build a simple GPU custom op on Ubuntu 16.0 with tf==1.15. I am not quite familiar with bazel, and am trying to learn it. I don't understand why this is so complicated to build a simple GPU op. :(

BruceDai003 avatar Aug 07 '20 06:08 BruceDai003

It seems that in your Makefile, the command to build the GPU op TimeTwo is very simple, just a nvcc command is OK. But to build with bazel, there are some other folders that are needed, e.g. 'gpu', 'tf', 'third_party' which makes me learning to build with bazel on a different machine quite difficult. Is there a simple way to use nvcc to build the GPU op?

BruceDai003 avatar Aug 07 '20 07:08 BruceDai003

I got the same problem. This cuda_fp16.h file is missing (I installed TensorFlow via pip). I also checked the TensorFlow master branch, there is no such file in the third_party directory either.

leimao avatar Sep 30 '20 00:09 leimao