java icon indicating copy to clipboard operation
java copied to clipboard

Strategies for getting Tensorflow-Java on Apple Silicon?

Open kgoderis opened this issue 4 years ago • 71 comments

Like some others I am in need to get Tensorflow-java running on an M1 based machine, certainly now that Apple has released a Tensorflow distribution for M1

[I know there is https://github.com/tensorflow/java/issues/252 but I want to revive the discussion after Apple's recent efforts]

Before even to attempt doing this I was wondering of any of the underlying strategies do make sense, or alternatively, do work

  1. Compile from source, target arm64 arch, using arm64 tools (e.g Bazel), run java.jar using a arm64 JVM like Zulu 8.58.0.13-CA-macos-aarch64

[This one fails based on the current HEAD. (java.lang.NoSuchMethodError: 'java.lang.Iterable com.sun.tools.javac.code.Scope$WriteableScope.getSymbolsByName(com.sun.tools.javac.util.Name, com.sun.tools.javac.util.Filter)'). It does not even gets to the TF native build phase]

  1. Compile from source using x86 tools (e.g. in a "arch -x86_64 zsh" shell), taking into account specific guidelines e.g. remove usage of specific instruction sets. Consequently, run the java.jar using a x86 JVM, e.g. thus under Rossetta

  2. Any other angle to look at the problem ?

[For that matter, how to leverage other ML frameworks on M1, e.g deeplearning4j ?]

kgoderis avatar Nov 08 '21 12:11 kgoderis

Self-note. It seems that the above error is due to me building against a 1.8 JDK, instead of something more recent

kgoderis avatar Nov 08 '21 14:11 kgoderis

Don't try and compile TF-Java using Rosetta, you'll pull in a TF binary which has AVX instructions which will cause a SIGILL and take down the JVM.

I've not tried to compile it on an M1 since we bumped to TF 2.6.0 and made some build changes, I can take a look at doing that. Theoretically you should be able to run mvn package and have it build everything, but I think you'll need to be in a venv which has a version of numpy installed, and be running bazel natively rather than via Rosetta. After that it comes down to a bunch of weird configuration things in bazel which we might not be patching appropriately.

As for other ML frameworks, I've personally got XGBoost and ONNX Runtime working in Java on an M1 Mac and contributed any fixes back upstream. We had ONNX Runtime working a month or two after the M1 came out. Anything that's in pure Java will work just fine on an M1, but I've not looked at dl4j or djl which both have large native libraries inside.

Craigacp avatar Nov 08 '21 15:11 Craigacp

@Craigacp Building under Rosetta but using a TF build config file without the AVX instructions is not an option then ? I was not aware that it is pulling a TF binary, I was under the impression that it pulls the TF repo and does compile TF as part of the TF-J build process.

kgoderis avatar Nov 08 '21 16:11 kgoderis

@Craigacp Any pointer on how to get ONNX going, because the this what I get on the home page? LOL image

[Edit : I presume you did some cross-compiling to get it work . Going through the docs right now...]

kgoderis avatar Nov 08 '21 16:11 kgoderis

@Craigacp Building under Rosetta but using a TF build config file without the AVX instructions is not an option then ? I was not aware that it is pulling a TF binary, I was under the impression that it pulls the TF repo and does compile TF as part of the TF-J build process.

Java is slow under Rosetta as it messes with the JIT. You could compile TF without AVX support under Rosetta, but it would probably be fairly slow, and at that point I'm not sure what the utility of it is.

@Craigacp Any pointer on how to get ONNX going, because the this what I get on the home page? LOL image

[Edit : I presume you did some cross-compiling to get it work . Going through the docs right now...]

I've not tried cross-compiling. Checkout the ONNX Runtime repo on a M1 Mac and then compile it as normal for java ./build.sh --update --build --config Release --parallel --build_java --test.

Craigacp avatar Nov 08 '21 16:11 Craigacp

Java is slow under Rosetta as it messes with the JIT. You could compile TF without AVX support under Rosetta, but it would probably be fairly slow, and at that point I'm not sure what the utility of it is.

Well, my main dev machine is now an M1, obviously. So, the utility lies in developing TF models locally, but consequently then train them on a TPU/x86 cloud-based machine. I just want to avoid any pain in my development process.

In order to compile it under Rosetta I presume that I need an x86 JVM installed on top of other x86 tools like Bazel, right ?

kgoderis avatar Nov 08 '21 16:11 kgoderis

Yes, you'll need a full x86 development stack, including Python, probably including compilers as well, and then you might need to change how it finds the compilers to make sure it picks the x86 ones.

Craigacp avatar Nov 08 '21 17:11 Craigacp

Some people seem to be able to get arm64 binaries for TF 2.6.0, for example, see https://github.com/tensorflow/tensorflow/issues/52160#issuecomment-933987607. If that is true, running the build for TF Java with a command line this should work:

BUILD_FLAGS="--cpu=darwin_arm64 --host-cpu=darwin_arm64" mvn clean install

saudet avatar Nov 08 '21 23:11 saudet

@saudet That did not work unfortunately.

I am able to compile the Tensorflow repo (https://github.com/tensorflow/tensorflow/issues/52160#issuecomment-968173580), but then Tf-J fails with

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) on project tensorflow-core-generator: Compilation failure
[ERROR] /Users/kgoderis/Development/tensorflow-java/tensorflow-core/tensorflow-core-generator/src/main/java/org/tensorflow/proto/framework/OpListOrBuilder.java:[23,7] error: An unhandled exception was thrown by the Error Prone static analysis plugin.
[ERROR]      Please report this at https://github.com/google/error-prone/issues/new and include the following:
[ERROR]   
[ERROR]      error-prone version: 2.6.0
[ERROR]      BugPattern: JavaLangClash
[ERROR]      Stack Trace:
[ERROR]      java.lang.NoSuchMethodError: 'java.lang.Iterable com.sun.tools.javac.code.Scope$WriteableScope.getSymbolsByName(com.sun.tools.javac.util.Name, com.sun.tools.javac.util.Filter)'
[ERROR]   	at com.google.errorprone.bugpatterns.JavaLangClash.check(JavaLangClash.java:66)
[ERROR]   	at com.google.errorprone.bugpatterns.JavaLangClash.matchClass(JavaLangClash.java:53)
[ERROR]   	at com.google.errorprone.scanner.ErrorProneScanner.processMatchers(ErrorProneScanner.java:450)
[ERROR]   	at com.google.errorprone.scanner.ErrorProneScanner.visitClass(ErrorProneScanner.java:548)
[ERROR]   	at com.google.errorprone.scanner.ErrorProneScanner.visitClass(ErrorProneScanner.java:151)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.tree.JCTree$JCClassDecl.accept(JCTree.java:860)
[ERROR]   	at jdk.compiler/com.sun.source.util.TreePathScanner.scan(TreePathScanner.java:86)
[ERROR]   	at com.google.errorprone.scanner.Scanner.scan(Scanner.java:74)
[ERROR]   	at com.google.errorprone.scanner.Scanner.scan(Scanner.java:48)
[ERROR]   	at jdk.compiler/com.sun.source.util.TreeScanner.scan(TreeScanner.java:111)
[ERROR]   	at jdk.compiler/com.sun.source.util.TreeScanner.scanAndReduce(TreeScanner.java:119)
[ERROR]   	at jdk.compiler/com.sun.source.util.TreeScanner.visitCompilationUnit(TreeScanner.java:152)
[ERROR]   	at com.google.errorprone.scanner.ErrorProneScanner.visitCompilationUnit(ErrorProneScanner.java:561)
[ERROR]   	at com.google.errorprone.scanner.ErrorProneScanner.visitCompilationUnit(ErrorProneScanner.java:151)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.tree.JCTree$JCCompilationUnit.accept(JCTree.java:614)
[ERROR]   	at jdk.compiler/com.sun.source.util.TreePathScanner.scan(TreePathScanner.java:60)
[ERROR]   	at com.google.errorprone.scanner.Scanner.scan(Scanner.java:58)
[ERROR]   	at com.google.errorprone.scanner.ErrorProneScannerTransformer.apply(ErrorProneScannerTransformer.java:43)
[ERROR]   	at com.google.errorprone.ErrorProneAnalyzer.finished(ErrorProneAnalyzer.java:152)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.api.MultiTaskListener.finished(MultiTaskListener.java:132)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.main.JavaCompiler.flow(JavaCompiler.java:1394)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.main.JavaCompiler.flow(JavaCompiler.java:1341)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.main.JavaCompiler.compile(JavaCompiler.java:933)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.main.Main.compile(Main.java:317)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.main.Main.compile(Main.java:176)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.Main.compile(Main.java:64)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.Main.main(Main.java:50)
[ERROR] 

On the other hand, diving into the ./tensorflow-core/tensorflow-core-api where the TF core should be built, I was able to start the compilation (sudo bazel build --config opt --cpu=darwin_arm64 --host_cpu=darwin_arm64 --incompatible_restrict_string_escapes=false --experimental_repo_remote_exec --define=ABSOLUTE_JAVABASE=/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home --host_javabase=@bazel_tools//tools/jdk:absolute_javabase //:all), but is soon exited with errors like these:

external/org_tensorflow/tensorflow/core/platform/default/port.cc:360:14: error: no matching constructor for initialization of 'tensorflow::port::MemoryInfo'
  MemoryInfo mem_info = {INT64_MAX, INT64_MAX};
             ^          ~~~~~~~~~~~~~~~~~~~~~~
external/org_tensorflow/tensorflow/core/platform/mem.h:62:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 2 were provided
struct MemoryInfo {
       ^
external/org_tensorflow/tensorflow/core/platform/mem.h:62:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 2 were provided
external/org_tensorflow/tensorflow/core/platform/mem.h:62:8: note: candidate constructor (the implicit default constructor) not viable: requires 0 arguments, but 2 were provided
external/org_tensorflow/tensorflow/core/platform/default/port.cc:373:23: error: no matching constructor for initialization of 'tensorflow::port::MemoryBandwidthInfo'
  MemoryBandwidthInfo membw_info = {INT64_MAX};
                      ^            ~~~~~~~~~~~
external/org_tensorflow/tensorflow/core/platform/mem.h:67:8: note: candidate constructor (the implicit copy constructor) not viable: no known conversion from 'long long' to 'const tensorflow::port::MemoryBandwidthInfo' for 1st argument
struct MemoryBandwidthInfo {
       ^
external/org_tensorflow/tensorflow/core/platform/mem.h:67:8: note: candidate constructor (the implicit move constructor) not viable: no known conversion from 'long long' to 'tensorflow::port::MemoryBandwidthInfo' for 1st argument
external/org_tensorflow/tensorflow/core/platform/mem.h:67:8: note: candidate constructor (the implicit default constructor) not viable: requires 0 arguments, but 1 was provided

kgoderis avatar Nov 14 '21 11:11 kgoderis

Update : Bumping Google's errorprone to <errorprone.version>2.10.0</errorprone.version> fixes this error

kgoderis avatar Nov 14 '21 11:11 kgoderis

Update: Changing .bazelrc in tensorflow-core/tensorflow-core-api to

build --remote_cache=https://storage.googleapis.com/tensorflow-sigs-jvm
build --remote_upload_local_results=false
build --action_env PYTHON_BIN_PATH="/Users/kgoderis/miniforge3/bin/python3"
build --action_env PYTHON_LIB_PATH="/Users/kgoderis/miniforge3/lib/python3.9/site-packages"
build --python_path="/Users/kgoderis/miniforge3/bin/python3"
build:opt --copt=-Wno-sign-compare
build:opt --host_copt=-Wno-sign-compare
test --flaky_test_attempts=3
test --test_size_filters=small,medium
test:v1 --test_tag_filters=-benchmark-test,-no_oss,-gpu,-nomac,-no_mac,-oss_serial
test:v1 --build_tag_filters=-benchmark-test,-no_oss,-gpu,-nomac,-no_mac
test:v2 --test_tag_filters=-benchmark-test,-no_oss,-gpu,-nomac,-no_mac,-oss_serial,-v1only
test:v2 --build_tag_filters=-benchmark-test,-no_oss,-gpu,-nomac,-no_mac,-v1only
build --incompatible_restrict_string_escapes=false 
build --experimental_repo_remote_exec
build --define=ABSOLUTE_JAVABASE=/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home
build --host_javabase=@bazel_tools//tools/jdk:absolute_javabase 

In addition I bumped .bazelversion to 4.2.1 and changed build.sh to make bazel run under sudo

gets the compilation of that maven compile unit going. Still contains references to the setup on my dev machine, but we are advancing ;-)

kgoderis avatar Nov 14 '21 11:11 kgoderis

Some of you will be happy. I got the whole thing compiled, however I had to skip tests as it was failing on that part, and there were some warnings on TARGET_OS_IPHONE. Apart from that, it kinda looks good:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for TensorFlow Java Parent 0.4.0-SNAPSHOT:
[INFO] 
[INFO] TensorFlow Java Parent ............................. SUCCESS [  0.476 s]
[INFO] TensorFlow Core Parent ............................. SUCCESS [  0.009 s]
[INFO] TensorFlow Core Generators ......................... SUCCESS [  0.277 s]
[INFO] TensorFlow Core API Library ........................ SUCCESS [ 48.684 s]
[INFO] TensorFlow Core API Library Platform ............... SUCCESS [  0.020 s]
[INFO] TensorFlow Framework Library ....................... SUCCESS [  0.069 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  49.599 s
[INFO] Finished at: 2021-11-14T13:39:12+01:00
[INFO] ------------------------------------------------------------------------
(base) kgoderis@Karels-M1-MacBook-Pro target % pwd
/Users/kgoderis/Development/tensorflow-java/tensorflow-core/tensorflow-core-api/target
(base) kgoderis@Karels-M1-MacBook-Pro target % ls -la 
total 180432
drwxr-xr-x  12 root      staff       384 Nov 14 13:19 .
drwxr-xr-x  17 kgoderis  staff       544 Nov 14 13:16 ..
drwxr-xr-x   3 root      staff        96 Nov 14 13:16 classes
drwxr-xr-x   3 root      staff        96 Nov 14 13:16 generated-sources
drwxr-xr-x   3 root      staff        96 Nov 14 13:16 generated-test-sources
drwxr-xr-x   3 root      staff        96 Nov 14 13:17 maven-archiver
drwxr-xr-x   3 root      staff        96 Nov 14 13:16 maven-status
drwxr-xr-x   3 root      staff        96 Nov 14 13:16 native
drwxr-xr-x  82 root      staff      2624 Nov 14 13:19 surefire-reports
-rw-r--r--   1 root      staff  78551487 Nov 14 13:32 tensorflow-core-api-0.4.0-SNAPSHOT-macosx-arm64.jar
-rw-r--r--   1 root      staff   8245523 Nov 14 13:32 tensorflow-core-api-0.4.0-SNAPSHOT.jar
drwxr-xr-x   5 root      staff       160 Nov 14 13:17 test-classes
(base) kgoderis@Karels-M1-MacBook-Pro tensorflow % pwd
/Users/kgoderis/Development/tensorflow-java/tensorflow-core/tensorflow-core-api/bazel-bin/external/org_tensorflow/tensorflow
(base) kgoderis@Karels-M1-MacBook-Pro tensorflow % file libtensorflow_framework.2.6.0.dylib
libtensorflow_framework.2.6.0.dylib: Mach-O 64-bit dynamically linked shared library arm64

kgoderis avatar Nov 14 '21 12:11 kgoderis

The build fails with Java 8 (arm64)

ERROR: /private/var/tmp/_bazel_root/6712ec151cb8fc337cc5082ff0f496e3/external/bazel_tools/tools/jdk/BUILD:346:14: Action external/bazel_tools/tools/jdk/platformclasspath.jar failed: (Exit 1): java failed: error executing command 
  (cd /private/var/tmp/_bazel_root/6712ec151cb8fc337cc5082ff0f496e3/execroot/tensorflow_core_api && \
  exec env - \
  /Library/Java/JavaVirtualMachines/zulu-8-arm64.jdk/Contents/Home/bin/java -XX:+IgnoreUnrecognizedVMOptions '--add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.platform=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED' -cp bazel-out/darwin_arm64-opt/bin/external/bazel_tools/tools/jdk/platformclasspath_classes:/Library/Java/JavaVirtualMachines/zulu-8-arm64.jdk/Contents/Home/lib/tools.jar DumpPlatformClassPath bazel-out/darwin_arm64-opt/bin/external/bazel_tools/tools/jdk/platformclasspath.jar external/local_jdk)
Execution platform: @local_execution_config_platform//:platform
Exception in thread "main" java.lang.AssertionError: 
Could not find java.lang.Object on bootclasspath; something has gone terribly wrong.
Please file a bug: https://github.com/bazelbuild/bazel/issues
	at DumpPlatformClassPath.writeEntries(DumpPlatformClassPath.java:136)
	at DumpPlatformClassPath.writeClassPathJars(DumpPlatformClassPath.java:174)
	at DumpPlatformClassPath.dumpJDK8BootClassPath(DumpPlatformClassPath.java:77)
	at DumpPlatformClassPath.main(DumpPlatformClassPath.java:65)
ERROR: /private/var/tmp/_bazel_root/6712ec151cb8fc337cc5082ff0f496e3/external/com_google_protobuf/BUILD:290:15 Building external/com_google_protobuf/libany_proto-speed.jar (1 source jar) failed: (Exit 1): java failed: error executing command 

[Update] It seems one has to be explicit about the toolchain in .bazelrc. I added

build --host_javabase=@bazel_tools//tools/jdk:absolute_javabase
build --javabase=@bazel_tools//tools/jdk:absolute_javabase
build --host_java_toolchain=@bazel_tools//tools/jdk:toolchain_hostjdk8
build --java_toolchain=@bazel_tools//tools/jdk:toolchain_hostjdk8

but unfortunately it fails because in pom.xml we use --add-exports flags for the JVM, which is not supported by the 1.8 JDK. I am not sure why we need these flags in the first place ( I know what the flag is supposed to do), and therefore, what could be a workaround solution. Anyone?

kgoderis avatar Nov 15 '21 08:11 kgoderis

If it builds with 11 why do you need to build it with 8? It should produce Java 8 compatible jar files even when compiled on 11.

Craigacp avatar Nov 15 '21 19:11 Craigacp

If it builds with 11 why do you need to build it with 8? It should produce Java 8 compatible jar files even when compiled on 11.

Because I want to integrate this in a project which uses Spark NLP, and that only runs on a Java 8 VM. As far as I understand, Java 11 compiled jars do not run ok older JVMs

kgoderis avatar Nov 15 '21 19:11 kgoderis

TF-Java is compiled on 11 but targets 8, and so will produce class files which are compatible with Java 8.

Craigacp avatar Nov 15 '21 19:11 Craigacp

TF-Java is compiled on 11 but targets 8, and so will produce class files which are compatible with Java 8.

Ah... I was not aware of this. That means we are good to go. Will you pick up what we did and get the jars onto sonatype?

kgoderis avatar Nov 15 '21 19:11 kgoderis

I'm trying to replicate what you have on my M1 Mac so I can figure out what the test failures are, but I'm getting issues compiling protobuf.

We can't easily deploy to Maven Central as our builds are done through Github Actions and they don't have any Apple Silicon runners.

Craigacp avatar Nov 15 '21 19:11 Craigacp

@Craigacp I think I solved that by altering .bazelrc cfr https://github.com/tensorflow/java/issues/394#issuecomment-968668060

or this

build --define=ABSOLUTE_JAVABASE=/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home build --host_javabase=@bazel_tools//tools/jdk:absolute_javabase

Not sure in fact, I did many things and document only half of it

kgoderis avatar Nov 15 '21 22:11 kgoderis

I had to make some modifications to the pom files to get it to build the appropriate jars, and having to run the build as the superuser is worrying to me. I also had to add my JVM as a bazel build flag too. However I didn't need to do anything else to my .bazelrc other than build --incompatible_restrict_string_escapes=false. My build does now pass all the tests with those modifications.

We're upgrading to TF 2.7.0 at the moment, I'll rerun the build after that has merged in as it might fix some of the issues.

Craigacp avatar Nov 15 '21 22:11 Craigacp

If I remember well the tests failed on a mismatch of dimension on the input matrix on a NN layer. Mind you that I tried to compile against Java 8 cfr my misunderstanding.

kgoderis avatar Nov 15 '21 22:11 kgoderis

Could you try and build a clean checkout of this branch - https://github.com/Craigacp/tensorflow-java/tree/apple-silicon ? It'll require sudo, and I don't want that in an actual build, but it would be a useful check if someone else can build it.

Craigacp avatar Nov 15 '21 22:11 Craigacp

@Craigacp Trying to do so. However, need Google Error Prone bumped to 2.10.0, and what about Bazel? 3.7.2 or 4.2.1 ?

kgoderis avatar Nov 15 '21 22:11 kgoderis

In fact, I remember I went for Bazel 4.2.1 because there are no pre-compiled Bazel binaries For MacOS arm64, and I wanted to avoid to Compile Bazel from source

kgoderis avatar Nov 15 '21 22:11 kgoderis

Error prone should work on Java 11. That build is hard coded to expect a Azul Zulu 11 installed in the system. The bazel version is set to 4.2.1 and the whole thing should build with mvn clean package without other modifications, the same way the x86 builds do.

Craigacp avatar Nov 15 '21 23:11 Craigacp

Are you still trying to use Bazel 4.2.1? Because that's what the error is complaining about. We require 3.7.2 because tensorflow does.

rnett avatar Nov 16 '21 00:11 rnett

I set the bazelversion to 4.2.1. Maybe it's not cleaned the build properly?

Craigacp avatar Nov 16 '21 00:11 Craigacp

It does not work

  • Zulu 11 + Error prone 2.6.0 -> error
  • still needs sudo
  • Bazel 4.2.1 needed as no arm binaries for 3.7.2 are available

kgoderis avatar Nov 16 '21 00:11 kgoderis

Are you still trying to use Bazel 4.2.1? Because that's what the error is complaining about. We require 3.7.2 because tensorflow does.

Yes, but I got the whole thing compiled with 4.2.1 last weekend

kgoderis avatar Nov 16 '21 00:11 kgoderis

I had to make some modifications to the pom files to get it to build the appropriate jars,

Ah, yes, we'll need to update the profiles in the pom.xml files a bit like pull https://github.com/bytedeco/javacpp-presets/pull/1092 for this to work. Are you saying you've already done this? Or should I do it?

saudet avatar Nov 16 '21 00:11 saudet