swift icon indicating copy to clipboard operation
swift copied to clipboard

Missing symbols for custom op registration

Open s1ddok opened this issue 4 years ago • 3 comments

This issue is opened as a continuation to twitter discussion with @dan-zheng;

What I'm trying to do is to introduce a custom op and use it with S4TF. For simplicity I'm trying to follow along official TF guide on custom op creation and compile a sample op (zero_out) from here.

I tried to do it with different variations:

1. xcrun --toolchain swift-tensorflow-RELEASE-0.7 clang++ -shared  ops/zero_out_ops.cc kernels/zero_out_kernels.cc -fPIC -O2 -undefined dynamic_lookup -o zero_out.so -I/Library/Python/2.7/site-packages/tensorflow_core/include -std=c++11 -ltensorflow -L/Library/Developer/Toolchains/swift-tensorflow-RELEASE-0.7.xctoolchain/usr/lib/swift/macosx

2. xcrun --toolchain swift-tensorflow-RELEASE-0.7 clang++ -shared  ops/zero_out_ops.cc kernels/zero_out_kernels.cc -fPIC -O2 -undefined dynamic_lookup -o zero_out.so -I/Library/Python/2.7/site-packages/tensorflow_core/include -std=c++11

3. g++ -shared  ops/zero_out_ops.cc kernels/zero_out_kernels.cc -fPIC -O2 -undefined dynamic_lookup -o zero_out.so -I/Library/Python/2.7/site-packages/tensorflow_core/include -std=c++11 -D_GLIBCXX_USE_CXX11_ABI=0

It all cases it compiles successfuly an zero_out.so artifact is successfully loaded via python TF 2.1.0-rc1 (which is the version I use headers from as it was used to compile 0.7 toolchain)

Next, I try to load it from Swift side with this dumb code:

import CTensorFlow
import TensorFlow
import Foundation

let tensor = Tensor<Float>(shape: [1,1], scalars: [0])
var path = "/Users/avolodin/Downloads/custom-op-master/tensorflow_zero_out/cc/zero_out.so".data(using: .utf8)
var status: OpaquePointer = TF_NewStatus()

let libraryHandle = path?.withUnsafeBytes({ pointer in
    TF_LoadLibrary(pointer, status)
})

print(TF_GetCode(status))
print(String(cString: TF_Message(status)!))

and the message being printed is the following:

dlopen(/Users/avolodin/Downloads/custom-op-master/tensorflow_zero_out/cc/zero_out.so, 6): Symbol not found: __ZTIN10tensorflow8OpKernelE
  Referenced from: /Users/avolodin/Downloads/custom-op-master/tensorflow_zero_out/cc/zero_out.so
  Expected in: flat namespace
 in /Users/avolodin/Downloads/custom-op-master/tensorflow_zero_out/cc/zero_out.so

so after an investigation is think that is happening because swift build-script-impl specifies --define framework_shared_object=false which according to the official TF docs removes custom op symbols:

Note that :framework and :lib have incomplete transitive dependencies (they declare but do not define some symbols) if framework_shared_object=True (meaning there is an explicit framework shared object). Missing symbols are included in //tensorflow:libtensorflow_framework.so. This split supports custom op registration; see comments on //tensorflow:libtensorflow_framework.so. It does mean that TensorFlow cc_test and cc_binary rules will not build. Using tf_cc_test and tf_cc_binary (from //tensorflow/tensorflow.bzl) will include the necessary symbols in binary build targets.

s1ddok avatar Feb 25 '20 09:02 s1ddok

Thanks for filing this issue and doing some thorough investigation!

As you noted, libtensorflow_framework.so seems necessary for your custom op registration. I'm not sure we've tried custom TensorFlow op registration in open source, so this is cool! I'm not very familiar with libtensorflow_framework.so.

Previously (during Swift for TensorFlow 0.2 - 0.5?), we did build libtensorflow_framework.so. However, we removed it: I think because it wasn't/isn't needed for tensorflow/swift-apis and had no known users, but maybe also because it caused some linker errors. @pschuh: do you remember why we removed it?

We could try building a toolchain with libtensorflow_framework.so (via --define framework_shared_object=true) to see if that resolves your issue.

dan-zheng avatar Feb 25 '20 09:02 dan-zheng

I got it working!

It was fairly easy (outside of me waiting for 3 hours for my MBP to build the whole toolchain from scratch). I did two things:

  1. Switched makeOp function to being public in order to enable user to create their own ops
  2. Changed bazel flag to framework_shared_object=true

And it just worked!

s1ddok avatar Feb 25 '20 21:02 s1ddok

I've made a PR to public function here, but patching build-script-impl appeared to be difficult since tensorflow and tensorflow-0.7 are now diverged too much in terms of build-script-impl, so maybe you can guide me here?

s1ddok avatar Feb 25 '20 21:02 s1ddok