java icon indicating copy to clipboard operation
java copied to clipboard

The Java Tensorflow library does not seem to be using GPU

Open tmichniewski opened this issue 3 years ago • 61 comments

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): YES
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NO
  • TensorFlow installed from (source or binary): from https://oss.sonatype.org/
  • TensorFlow version (use command below): 2.3.1
  • Python version: 3.7.7
  • Bazel version (if compiling from source): NO
  • GCC/Compiler version (if compiling from source): NO
  • CUDA/cuDNN version: 10.1
  • GPU model and memory: Tesla K80, compute capability 3.7 (but we also tested this on Tesla V100 7.0 compute capability)

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Here is the result of the capture script: tf_env.txt

Describe the current behavior We tested the new Tensorflow Java API (not the legacy one). The brand new version released in October 2020. We tested it on some machines including Azure Databricks NC6_v3 and Azure Virtual Machines (the capture log is from the virtual machine). I noticed that in case of no GPU available the library falls back to CPU. And this is fine. However we also measured the time for some example processing (a few vector operations). And we see that there is no significant difference between processing time on GPU and on CPU. It looks as it is not using GPU, even if this is present (we tried two graphic cards: Tesla K80 with compute compatibility 3.7 and Tesla V100 with compute compatibility 7.0). In both cases we do not see any difference in processing time.

Describe the expected behavior Expected behaviour is to get execution times much better if the program is executed on a machine with GPU present.

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

We used the following java program:

HelloTensorFlow_java.txt pom_xml.txt

The source was compiled to class file and it was run via the following command: java -classpath protobuf-java-3.8.0.jar:ndarray-0.2.0.jar:javacpp-1.5.4.jar:javacpp-1.5.4-linux-x86_64.jar:tensorflow-core-api-0.2.0.jar:tensorflow-core-api-0.2.0-linux-x86_64-gpu.jar:tensorflow-core-platform-gpu-0.2.0.jar:. HelloTensorFlow

The listed libraries were downloaded from https://oss.sonatype.org/.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

The enclosed program issues the following log: log.txt

From the log you may see that the GPU was present and recognized. However the execution time did not differ, when we started it with GPU and without.

tmichniewski avatar Nov 03 '20 14:11 tmichniewski

Hi @tmichniewski ,

Can you assign explicitly an operation to a GPU device and see if that works?

Now, there is an ongoing effort to make device selection easier on TF Java, as right now you need unfortunately to use the lower-level API for building an operation.

For example, instead of simply call tf.math.add(x, y), you need to do something like:

DeviceSpec gpuSpec = DeviceSpec.newBuilder().deviceType(DeviceType.GPU).deviceIndex(0).build();
...
graph.opBuilder(Add.OP_NAME, "Add1").setDevice(gpuSpec.toString()).addInput(x.asOutput()).addInput(y.asOutput()).build();

karllessard avatar Nov 03 '20 14:11 karllessard

You can also log device placement by passing explicitly a ConfigProto message when instantiating your Session, using this constructor

karllessard avatar Nov 03 '20 14:11 karllessard

Hello @karllessard, Referring to the first recommendation: are you sure that we could call deviceType on DeviceSpec.newBuilder()? As my IntelliJ complains about this.

tmichniewski avatar Nov 03 '20 14:11 tmichniewski

Basically I am looking for simplest solution, just for testing the performance.

tmichniewski avatar Nov 03 '20 14:11 tmichniewski

BTW - I am using version 0.2.0.

tmichniewski avatar Nov 03 '20 14:11 tmichniewski

@tmichniewski , to be honest I haven't tested the code snippet I've posted you but something around that should work. You can also simply pass directly the device spec as a string for now, as described here. I'll be curious to see what device placement logging is telling you as well, if you get a chance to activate it.

karllessard avatar Nov 03 '20 15:11 karllessard

I already used hardcoded string "/GPU:0".

tmichniewski avatar Nov 03 '20 15:11 tmichniewski

Hello @karllessard, Well, I made a tests on Azure Databricks machine NC6s_v3 with graphic card Tesla V100 compute comtability 7.0. With device set to "/CPU:0" the execution time is 0.23s. With device set to "/GPU:0" the execution time is 0.62s. So not only it does not reduce the execution time but also it increases it. The current code is as follows: HelloTensorFlow_java.txt

tmichniewski avatar Nov 03 '20 15:11 tmichniewski

BTW - on this cluster in Python we got following times: CPU 1.00s. GPU 0.25s. PythonHelloTensorFlow_py.txt So in Python GPU is 4 times faster in this exercise. But in Java GPU version is slower than CPU version.

tmichniewski avatar Nov 03 '20 15:11 tmichniewski

Maybe I should also set the device to the session or graph?

tmichniewski avatar Nov 03 '20 16:11 tmichniewski

But then it would make no sense to set it on operation level. Do you have any working example of how to perform let say vector addition on GPU?

tmichniewski avatar Nov 03 '20 16:11 tmichniewski

Why not try running the Java version in eager mode like the python version is?

Craigacp avatar Nov 03 '20 16:11 Craigacp

Firstly, because eager mode is only for development, not production. Secondly - well - for tests of course I might use it, but do you have some working example with suggestion how to pass the device on which TF is being executed? Originally I used it, but Karl (in his first comment) sugested to use low level API.

tmichniewski avatar Nov 03 '20 16:11 tmichniewski

But the python example is in eager mode, so let's rule out the difference in the TF runtime engine first, before moving on to issues inside the Java API. I'm sure there are speed issues in the Java API, but best start with equivalent metrics first. Either run the Java one in eager mode, or the python one as a tf.function.

Craigacp avatar Nov 03 '20 16:11 Craigacp

@Craigacp Hello Adam,

I am not comparing Python times to Java times. I try to compare GPU time to CPU time. Python code was only to show that GPU version was roughly 4x faster then CPU. Or in other words - I try to run TF Java API on GPU. So far without success.

Right - in Python it was in eager mode. The same in original Java version I guess. But this issue is about how to make java API to execute on GPU. So far I was pointed to the low level API, but this does not work. That is why I ask for:

  • either some hint of how to correctly pass the device to the API to make it work on GPU (in eager mode or static one),
  • or some link to some working example (the API is quite fresh and there are no that many examples/tutorials in the net).

tmichniewski avatar Nov 03 '20 17:11 tmichniewski

Moreover, Karl also pointed me in his third comment to the following description, which clearly states that if there is a choice, then GPU takes precedence:

"If a TensorFlow operation has both CPU and GPU implementations, by default the GPU devices will be given priority when the operation is assigned to a device. For example, tf.matmul has both CPU and GPU kernels. On a system with devices CPU:0 and GPU:0, the GPU:0 device will be selected to run tf.matmul unless you explicitly request running it on another device."

tmichniewski avatar Nov 03 '20 18:11 tmichniewski

I cannot think of any reason why running graph ops on a GPU in Java would be slower than in Python, since this is all handled by the C++ core library shared by both languages. So if it is really what we are observing, then it has to be a misconfiguration somewhere.

Another idea, can you check @tmichniewski if the performance goes back to normal if you remove the line inside the loop where you fetch the result from the tensor? Maybe you are fetching data from the GPU memory to the JVM and that is doing extra copies that are not required when using Python?

karllessard avatar Nov 03 '20 19:11 karllessard

I also do not believe that Java version could be slover than Python one. But this is not the issue here. The problem is that it seems that GPU is not used at all.

The previous version of this library 2.3.0, the so called lagacy (since last weekend) was saying that the kernel image was incorect and there was suggestion to build TF from sources. But here the library silently falls back to CPU and you never know whether GPU is used except by measuring time.

Refering to second paragraph, well, I thought that it is always necessary to get the data back from tensor. BTW, we read only one double value, because the last operation in graph is reduce_sum, so we take 8 bytes instead of big vector. Therefore it cannot be reason of slow performance.

PS. Do you test it on GPU? I guess so, therefore I think you should have some observations and know-how of how to make it work on GPU.

tmichniewski avatar Nov 03 '20 19:11 tmichniewski

Maybe there is some issue with graphic card compute capability. In our case it is 7.0 and was also 3.7. Does this library work with cards with such parameters? If not then this would explain why it silently falls back to CPU.

Or maybe the issue is that the choice of GPU device is not defined correctly. So I ask for some example of how to do this. For example how to perform vector addition on GPU. Very simple example. So far I provided two versions, eager and graph, but both are not using GPU.

tmichniewski avatar Nov 03 '20 20:11 tmichniewski

Sorry @tmichniewski but I'm not personally using GPUs nor have access to one right now so I might not be the best person to provide you a concrete working example of this here, maybe @zaleslaw or @saudet can chime in?

Again, I think logging device placement might reveal some explanations as well, see the Session constructor link I've provided earlier.

karllessard avatar Nov 03 '20 22:11 karllessard

I haven't been testing it really closely, but I know at least @roywei and @lanking520 have been using the GPU builds without too much trouble.

saudet avatar Nov 04 '20 06:11 saudet

Hello @karllessard, @saudet, @roywei and zaleslaw and lankin520,

I also could confirm that it is possible to use GPU builds without trouble. But the key question here is whether the GPU builds really use GPU if it is present on a machine. My tests seems to be telling that NO, they are not using GPU. At least so far I don't know how to make it use GPU.

I provided two sample Java programs: The first with simple vector computation where I expected that providing gpu versions of libraries should make TF use GPU, but it was not the case. The second example is after Karl's recommendation, where I additionally specified the device in the graph operations. Also without success.

Now, since this library silently falls back to CPU if GPU is not present or for some reason the library is not willing to use it, then how do you perform the unit tests?

One possibility it by comparing the execution times, but this is really delicate. What are the other possibilities? Because I think all the tests may be successful after the library falls back to CPU, while all of them should fail in case of GPU present on a machine and requested.

In my opinion there should be a possibility in the API to request computation on GPU and if this is not satisfied, then the unit tests and the whole API should raise an exception saying that cannot be executed on requested GPU due to some reason.

Concluding I repeate my original issue - How to make Tensorflow Java API convince to use GPU?

Please point me to any working example or provide any working hint. Could you? PS. By working I understand not only working, but also using GPU if present.

tmichniewski avatar Nov 04 '20 07:11 tmichniewski

@tmichniewski hi, we were able to use GPU to run inference. For GPU use case, you need to make sure your operator and model are loaded on GPU device. You can also try with DJL that wrap TF Java package and provide some ease to automatically do GPU switch during runtime.

You can play around with our Jupiter notebook in Java http://docs.djl.ai/jupyter/tensorflow/pneumonia_detection.html

If you have GPU device and install CUDA 10.1, it will use GPU out of box without a single line of the code changes.

lanking520 avatar Nov 04 '20 08:11 lanking520

Hello @lanking520,

Refering to:

"we were able to use GPU to run inference. For GPU use case, you need to make sure your operator and model are loaded on GPU device." seems to be contradicting to your next statement: "If you have GPU device and install CUDA 10.1, it will use GPU out of box without a single line of the code changes."

So far I added setDevice("/GPU:0") to every graph operation like in the code below: HelloTensorFlow_java_2.txt just to be 100% sure where the computation is being executed.

But the results I get on my test computation on 100 mln entries vectors (so big enough) executed 1000 times do not show any time difference when executed on GPU or CPU. In fact the results are almost the same (6.0s. vs. 6.1s.).

On the contrary, when I executed the same test on Python in eager mode I got 4.5s. on GPU and 259s. on CPU, so the difference is huge and visible. Therefore I wonder why in case of using Java API I do not see the same difference? Could you explain it? python_test_2.txt

BTW - I am not comparing Python to Java or eager to graph execution. I am comparing GPU to CPU execution, where in Python I see huge defference while in Java I see no difference.

PS. I do not want to run these DJL samples which bring a lot of new elements to the picture and new changing parameters, so in general this may only complicate my tests instead of simplifying them. Moreover, if DJL wraps TF Java package, then it will change nothing if I do not see that TF java package is working well on GPU.

Maybe it is like this that TF Java API is well optimized to Deep Learning computations while not exactly to very simple vector operations.

tmichniewski avatar Nov 04 '20 13:11 tmichniewski

In general I think that there might be three possibilities:

  1. First possibility is that I might be doing something wrong in my test program, however I think you had a chance to look at it. Any way, if this is the case, then I simply ask for some support or just a reference to some simple example program, but with vector operations, not with sophisticated neural networks computations.
  2. Second option is that the TF Java API might not be using GPU. In this case I would ask for some hints of how to force it to use GPU.
  3. Finally there also might be some bug in the API which sees GPU and tries to use it but silently falls back to CPU. In this case you will know best how to handle this.

So my question is which situation we are dealing with? Or how to diagnose it?

tmichniewski avatar Nov 04 '20 13:11 tmichniewski

Hello @karllessard, Refering to DeviceSpec.newBuilder().deviceType(...) - I cannot use this construction because deviceType metod is called on object Builder, which is package-private and cannot be used outside of org.tensorflow package. This is why I could not use it. PS. This is is just a comment, because the code snippet you provided was perfectly valid, but only for you. :-)

tmichniewski avatar Nov 04 '20 14:11 tmichniewski

Mr. Tomasz, I'm also faced with a similar issue with the API (missed method on operand level, only on builder). Do you have a chance to experiment on the previous version of Java API (1.15, for example)? I found a very big difference for LeNet model training on GPU and without. It could be important to understand what was missed when we switched from 1.x to 2.x.

Agree, that DeviceSpec could not be used directly in API, but could be used as a builder to build a correct device string (if it's required).

Alex

ср, 4 нояб. 2020 г. в 17:02, Tomasz Michniewski [email protected]:

Hello @karllessard https://github.com/karllessard, Refering to DeviceSpec.newBuilder().deviceType(...) - I cannot use this construction because deviceType metod is called on object Builder, which is package-private and cannot be used outside of org.tensorflow package. This is why I could not use it. PS. This is is just a comment, because the code snippet you provided was perfectly valid, but only for you. :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/java/issues/140#issuecomment-721749110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJEUHI6MBMQP66UE4CCQWLSOFNGJANCNFSM4TIZUTNQ .

zaleslaw avatar Nov 04 '20 14:11 zaleslaw

Hello Alex @zaleslaw, 1.15.0 - so the API of Tensorflow 1 which had recent commit in Maven Repository in October 2019 (we skipped this from the very beginning as obsolete), 2.3.0 - so called Lagacy API since last weekend (in Python and Java API+Scala, but this version did not work in Java/Scala on our machines with GPU compute capability 3.7 and 7.0 - it was always saying that image in incorrect and we should build TF from sources), 0.2.0 - the newest API produced on October 2020 (in Java API). I experimented on Azure Databricks clusters, but also on Azure Virtual Machines instantiated on request.

So far I had not a single result with Java API where I could see - OK, computing on GPU is X times faster. Not a single result.

We managed to see such results only in Python, but never in Java API. That is why I started wordering what is wrong. Because basically we should be able to observe that GPU processing is sometimes faster, the same as in Python we see.

At the moment we focus only on this new API, as last week we spent a lot of time on this legacy version 2.3.0. We even started to analyse how to build it from sources.

Well - maybe there is also the same issue like in 2.3.0 - I mean the Java API recognizes that the graphic card compute capability is too low and silently falls back to CPU.

In my opinion I managed to force the API to use the specified GPU. But apparently it is not being used. Maybe this logic to fall back to CPU should be guarded somehow and if the requirement is to process on GPU and GPU cannot be used (is not present or has wrong compute capability), then maybe the error should be thrown, not a warning or nothing. You know, the developer/user requests to process on GPU, then if this is not possible, then "blue screen". Otherwise we end up in situations like this that we discuss what might be going on here.

tmichniewski avatar Nov 04 '20 15:11 tmichniewski

In our test platform, we benchmarked ResNet50 Image classification model, there is a small gap between Python GPU and Java GPU (you can find my issue filed in here). However, if you compare with CPU (AWS EC2 C5.2xlarge) it's around 40ms and GPU (AWS EC2 p3.2xlarge) is around 5.7ms. GPU is in fact much faster than CPU.

We use TF 2.3.1 build with 0.2.0

But it seemed the TF2 doing something weird that causing everything slow. We also did some benchmark on the cpu performance and see TF1 CPU latency is 2/3 of TF2 on Resnet 50 model with Java api. But same problems on Python too... So nothing to blame. :)

In fact based on your comments;

"GPU or CPU. In fact the results are almost the same (6.0s. vs. 6.1s.)."

"when I executed the same test on Python in eager mode I got 4.5s. on GPU and 259s. on CPU"

It seemed you run everything on GPU with TF Java :).

lanking520 avatar Nov 04 '20 17:11 lanking520

Refering to your last statement, well, I setup another cluster without GPU, so only CPU with similiar core class and got 7.1s.

Then I run this on a laptop with CPU only and also had similiar results like 6.6s. So these results in Java on machines with CPU only are much more close to Python on GPU.

But of course these are just comments on these results. Let us focus on facts. To get the facts I need some method to be able to check where my processing is executed. Is it possible to turn some logging, where the API would tell with 100% certainty on which device it us executing?

You know what, I believe that in your test you have better results on GPU than CPU. But you have a different machine than I. Could you confirm that in my case it is impossible that TF Java API falls back to CPU due to some reason, for example because of compute compatibility. Former Java API version 2.3.0 had problems with such compute capabilities. But new version 0.2.0 might just fall back to CPU instead of issuing error. This is how I interpret my test results. Because GPU and CPU results are almost the same.

Now, the question is how to diagnose it. So far it is far from intuitive.

tmichniewski avatar Nov 04 '20 19:11 tmichniewski