java icon indicating copy to clipboard operation
java copied to clipboard

Loading two different Tensorflow versions with different classloaders

Open carlosuc3m opened this issue 3 years ago • 10 comments
trafficstars

Hello, I am creating a program that makes inference with already trained models and I want to allow changing the tensorflow versions on runtime. For that I am loading dynamically the JARS needed to run Inference with TF Java in a separate classloader. These JARs are not in the classpath of the main program. I expected that when the ClassLoader is garbage collected, the native libraries loaded by that classloader were going to be unloaded too. However that is not the case and I am not able to load two different versions of Tensorflow in the same run time one after the other. The error I get when I try to execute a command with the JARs corresponding to the second TF version loaded is : 2021-12-21 14:47:11.121313: F external/org_tensorflow/tensorflow/core/framework/variant_op_registry.cc:46] Check failed: existing == nullptr (0x7ff35a2acc38 vs. nullptr)Unary VariantDecodeFn for type_name: tensorflow::data::WrappedDatasetVariant already registered

Is this behaviour expected? Is there any way I can do what I want? regards, Carlos

carlosuc3m avatar Dec 21 '21 13:12 carlosuc3m

According to the JNI Spec, native libraries should effectively be unloaded when its class loader is garbage collected. How do you verify if the previous loader has been collected?

@saudet , did you ever tried @carlosuc3m 's setup with JavaCPP before?

karllessard avatar Dec 21 '21 23:12 karllessard

Yes, JavaCPP can do that. I'm pretty sure TensorFlow can't be unloaded though...

saudet avatar Dec 22 '21 00:12 saudet

Also @carlosuc3m , between which versions do you need to switch?

karllessard avatar Dec 22 '21 00:12 karllessard

@carlosuc3m The JVM may not run GC fast enough, be sure to call System.gc() a few times, wait a couple of seconds, call System.gc() a couple more times, etc. You'll need to figure out what sequence works well enough for your case. It looks like we can unload TensorFlow, but it will end up creating memory leaks: https://discuss.tensorflow.org/t/dlopen-dlclose-tensorflow-so-cause-memory-leak/3639 If that is not alright for your application, please file a bug upstream about that: https://github.com/tensorflow/tensorflow/

saudet avatar Dec 23 '21 00:12 saudet

Thank you for your answer. Does the unloading depend on the specifications of the computer then?

@karllessard It should be able to change between all the TEsnorflow 2 availale versions. Although if there is complete backwards compatibility between the latest one and the others it should not matter Regards and thank you for your answer, Carlos

carlosuc3m avatar Jan 06 '22 12:01 carlosuc3m

Hello again, @saudet do you know if that loading and unloading works for both Windows and Linux or only Windows, because I am having a hard time trying to get it to unload in Windows. I have even tried to dlopen and dlclose with c++ and I got the same results. The following loads and unloads libtensorflow_framework.so.2 from Java Tensorflow 0.2.0 and from 0.3.1:

#include <iostream>
#include <dlfcn.h>

int main() {

     void *handle = dlopen("/home/carlos/Documents/test/tensorflow_0.2.0_linux_cpu/tensorflow-core-api-0.2.0-linux-x86_64/org/tensorflow/internal/c_api/linux-x86_64/libtensorflow_framework.so.2", RTLD_NOW);
     if (handle == nullptr) {
       std::cout << "Failed to dlopen " << dlerror() << std::endl;
     }
     int rc = dlclose(handle);
     if (rc != 0) {
       std::cout << "Failed to dlclose " << dlerror() << std::endl;
     }
     std::cout << "Loaded first native library\n";

     void *handle2 = dlopen("/home/carlos/Documents/test/tensorflow_0.3.3_linux_cpu/tensorflow-core-api-0.3.1-linux-x86_64/org/tensorflow/internal/c_api/linux-x86_64/libtensorflow_framework.so.2", RTLD_NOW);
     if (handle2 == nullptr) {
       std::cout << "Failed to dlopen " << dlerror() << std::endl;
     }
     int rc2 = dlclose(handle2);
     if (rc2 != 0) {
       std::cout << "Failed to dlclose " << dlerror() << std::endl;
     }
     std::cout << "ENd\n";


  return 0;
}

I get the following error: 2022-01-11 00:08:09.477288: F external/org_tensorflow/tensorflow/core/framework/variant_op_registry.cc:46] Check failed: existing == nullptr (0x5565bd59ec70 vs. nullptr)Unary VariantDecodeFn for type_name: tensorflow::data::WrappedDatasetVariant already registered

If I try dlopen and dlclose of libtensorflow_cc.so.2 i get the following error:

Failed to dlopen /home/carlos/Documents/test/tensorflow_0.3.3_linux_cpu/tensorflow-core-api-0.3.1-linux-x86_64/org/tensorflow/internal/c_api/linux-x86_64/libtensorflow_cc.so.2: undefined symbol: _ZN10tensorflow4data19DatasetBaseIterator4SkipEPNS0_15IteratorContextEiPbPi

Again, note that this only happens in Linux, am I missing something while unloading native libraries in Linux? REgards, CArlos

carlosuc3m avatar Jan 10 '22 23:01 carlosuc3m

... because I am having a hard time trying to get it to unload in Windows.

I think you meant Linux here. Interesting thing is that dlclose is not required to unload the library, even if the handle is being invalidated, according to the spec

My manual page on MacOSX is a bit different and provide some additional details:

     dlclose() releases a reference to the dynamic library or bundle referenced by handle.  If the reference count drops to 0, the bundle is removed from the address space, and handle is rendered invalid.
     Just before removing a dynamic library or bundle in this way, any termination routines in it are called.  handle is the value returned by a previous call to dlopen.

     Prior to Mac OS X 10.5, only bundles could be unloaded.  Starting in Mac OS X 10.5, dynamic libraries may also be unloaded.  There are a couple of cases in which a dynamic library will never be
     unloaded: 1) the main executable links against it, 2) an API that does not support unloading (e.g. NSAddImage()) was used to load it or some other dynamic library that depends on it, 3) the dynamic
     library is in dyld's shared cache.

karllessard avatar Jan 10 '22 23:01 karllessard

YEs sorry I meant Linux. So do you see any plausible solution? Or a way to track whether the native library is loaded or not? Thank you for your answer. Regards, Carlos

carlosuc3m avatar Jan 11 '22 12:01 carlosuc3m

Hello again, Sorry for all the inconvenience, I have another question. Until now I have been perfectly able in Windows to load and unload the Tensorfloa dlls, such as tensorflow_cc.dll. However I cannot do the same with the jni.dll, jnitensorflow.dll which even blocks the native TEnsorflow dlls. Do you know how to solve that? @saudet where you able to unload that .dll?

carlosuc3m avatar Jan 11 '22 19:01 carlosuc3m

Hello it's me again. I have just figured out how to load and unload the jni in Windows thanks to @saudet new feature in javacpp https://github.com/bytedeco/javacpp/commit/3c9d5999b68367a20b9fd51a0c6965448aab6c61 calling Pointer.interruptDeallocatorThread(). For that I am using the not released yet javacpp jar files from the 7th of January. HOwever, unfortunately, I am still stuck with the linux issue.

carlosuc3m avatar Jan 11 '22 21:01 carlosuc3m