djl icon indicating copy to clipboard operation
djl copied to clipboard

Allowing unloading the native libraries

Open carlosuc3m opened this issue 3 years ago • 16 comments
trafficstars

Description

I am working on an application that allows switching between different pytorch versions dynamically. In order to do so I load dynamically the JARs needed fo each particular version on a child classloader of the main classloader. However I am not able to switch between versions because the child classloader is never garbage collected so the native library is never unloaded and loading a two native libraries even of different versions causes errors. Can you think of any workaround to tackle this issue.

carlosuc3m avatar Dec 16 '21 21:12 carlosuc3m

You have to use custom ClassLoader to load DJL (more specifically Engine class). And you have to make they are not loaded by system classloader.

Once the custom ClassLoader object is garbage collected, you can load a different version of engine.

frankfliu avatar Dec 17 '21 03:12 frankfliu

HOw do I avoid loading by the system ClassLoader, as far as I understand, DJL loads the engines in the Thread ClassLoader: https://github.com/deepjavalibrary/djl/blob/3fce3aa58dc252b1b33efae04c6b6e37df1ba1a9/api/src/main/java/ai/djl/engine/Engine.java#L61

carlosuc3m avatar Dec 17 '21 17:12 carlosuc3m

The following piece of code is an example of dynamically loading the framework and how I am not able to GArbage Collect the new classloader:

// To check which native libraries have been loaded
Field LIBRARIES = ClassLoader.class.getDeclaredField("loadedLibraryNames");
LIBRARIES.setAccessible(true);
final Vector<String> libraries = (Vector<String>) LIBRARIES.get(Thread.currentThread().getContextClassLoader());
// Original ClassLoader
ClassLoader ogCl = Thread.currentThread().getContextClassLoader();
// Load JARs to new classloader
URL[] urls = new URL[new File(jarsDirectory).listFiles().length];
int c = 0;
for (File ff : new File(jarsDirectory).listFiles()) {
	urls[c ++] = ff.toURI().toURL();
}
URLClassLoader engineClassloader = new URLClassLoader(urls, null);
// Set the new ClassLoader as Thread ClassLoader
Thread.currentThread().setContextClassLoader(engineClassloader);
// Execute a simple command
Class<?> clM = engineClassloader.loadClass("ai.djl.ndarray.NDManager");
Method mm = clM.getMethod("newBaseManager");
Object manager = mm.invoke(null);
// Delete references to every object in the ClassLoader
clM = null;
mm = null;
manager = null;
// Set ClassLoader back
Thread.currentThread().setContextClassLoader(ogCl);
engineClassloader = null;
// Call Garbage collector
System.gc();
// Check loaded Native libraries, which are not the same
// as the original ones, Pytorch is still loaded
final Vector<String> libraries2 = (Vector<String>) LIBRARIES.get(Thread.currentThread().getContextClassLoader());

What do you think? My only idea currently is to workaround DJL code and load the native library from a classloader that is not the Thread classloader doing something like the following:

URLClassLoader engineClassloader = new URLClassLoader(urls, ogCl);
	    
Class<?> enginePt = engineClassloader.loadClass("ai.djl.pytorch.engine.PtEngine");
//Object enginePt = engineCl..newInstance();

Class<?> engineCl = engineClassloader.loadClass("ai.djl.engine.EngineProvider");
ServiceLoader<?> loaders = ServiceLoader.load(engineCl, engineClassloader);
Method mm = engineCl.getMethod("getEngine");
Object engine = null;
for (Object ll : loaders) {
    try {
        engine = mm.invoke(ll);
    } catch (Exception ex) {
    }
}

Taking into account that getEngine also look at the resources loaded in the Thread classloader. I really dont know if i am missing something so thank you very much for your time.

carlosuc3m avatar Dec 17 '21 19:12 carlosuc3m

@carlosuc3m You solution won't work work:

  1. URLClassLoader by default will use system classloader first, if the jar file in the classpath, system ClassLoader will always kick in. To prevent this happen, you have to implement your own ClassLoader, and use it to load all DJL classes, not just Engine (NDManager.class loaded by system may not work with Engine.class loaded by your ClassLoader)
  2. System.gc() may not kick in immediately, there is no guarantee ClassLoader will be gced after this call

You might want to consider use OSGi (might be overkill) for your use case.

frankfliu avatar Dec 18 '21 00:12 frankfliu

@carlosuc3m just out of curiosity, what's the use case you need run multiple pytorch version?

frankfliu avatar Dec 18 '21 00:12 frankfliu

Thank you for you answer @frankfliu The JAR files corresponding to DJL are all in a directory that it is not in the classpath, so in theory they should not be loaded by the System ClassLoaer, should they? Do I have to make: customClassLoader.loadClass("ai.djl.ndarray.NDManager") for every class in all the JAR files? And still how do you work around the calls to the Thread.currentThread().getContextClassLoader() that happen when loading the engine in: https://github.com/deepjavalibrary/djl/blob/3fce3aa58dc252b1b33efae04c6b6e37df1ba1a9/api/src/main/java/ai/djl/engine/Engine.java#L62 and https://github.com/deepjavalibrary/djl/blob/3fce3aa58dc252b1b33efae04c6b6e37df1ba1a9/api/src/main/java/ai/djl/util/Platform.java#L62

I am developing an application that is able to load pretrained models of Deep Learning. For that it should be able to change between Deep Learning engines dynamically depending on the model selected. The plugin is oriented towards users not familiar at all with Deep Learning or even programming, that is why all of this should happen on the backend without the user knowing. Regards, Carlos

carlosuc3m avatar Dec 18 '21 15:12 carlosuc3m

@carlosuc3m If the whole application (including djl jars) are not in the classpath it should work, but you need to explicitly set contextClassLoader. I created a test application, which load jars from DJL example module:

  1. model examples/build.gradle, to enable tasks.distZip.enabled = true
  2. build example jars
cd examples
./gradlew dZ
unzip build/distributions/examples-0.15.0-SNAPSHOT.zip
public final class ClassLoaderTest {

    private ClassLoaderTest() {
    }

    public static void main(String[] args) throws Exception {
        Path path = Paths.get("examples/examples-0.15.0-SNAPSHOT/lib");
        URL[] urls = Files.list(path).map(p -> {
                    try {
                        if (p.toString().endsWith(".jar")) {
                            return p.toUri().toURL();
                        }
                    } catch (IOException e) {
                        return null;
                    }
                    return null;
                }
        ).filter(Objects::nonNull).toArray(URL[]::new);

        test(urls);

        for (int i = 0;i < 10; ++i) {
            System.gc();
            Thread.sleep(1000);
        }

        test(urls);
    }

    public static void test(URL[] urls) throws ReflectiveOperationException {
        URLClassLoader cl = new URLClassLoader(urls);

        Thread.currentThread().setContextClassLoader(cl);
        Class<?> clazz = cl.loadClass("ai.djl.examples.inference.ObjectDetection");
        Method method = clazz.getDeclaredMethod("predict");
        method.invoke(null);
        Thread.currentThread().setContextClassLoader(null);
    }
}

frankfliu avatar Dec 18 '21 17:12 frankfliu

Yes, in that case it works as it is loading two times the same classloader, and the native libraries of both classloaders coincide. However, if the classloaders need to load two different native libraries an error will appear

carlosuc3m avatar Dec 20 '21 02:12 carlosuc3m

@carlosuc3m

Based on my test, the native library is unloaded and reloaded successfully. However, the inference failed when I try PyTorch 1.10.0 and 1.9.1:

libc++abi: terminating with uncaught exception of type c10::Error: Tried to register multiple backend fallbacks for the same dispatch key AutogradCUDA; previous registration registered at ../aten/src/ATen/BatchingRegistrations.cpp:1016, new registration registered at ../aten/src/ATen/ConjugateFallback.cpp:18

It looks like PyTorch cannot unload the shared library cleanly. MXNet seems works fine. I don't think this can be resolved by ClassLoader.

frankfliu avatar Dec 20 '21 05:12 frankfliu

Ok thank you for your time @frankfliu! The problematic native file seemed to be libtorch_cpu.so and it seems that loading two of them is not possible as per https://github.com/pytorch/pytorch/issues/70191 REgards and thank you for your time

carlosuc3m avatar Dec 20 '21 22:12 carlosuc3m

@carlosuc3m JavaCPP implements a hack to allow this to work, so your use case works with the JavaCPP Presets for PyTorch: https://github.com/bytedeco/javacpp-presets/tree/master/pytorch

@frankfliu Please consider doing something like JavaCPP to accommodate users of containers like Tomcat, OSGi, etc.

saudet avatar Dec 29 '21 10:12 saudet

Hello again @frankfliu , I am still wrking with this issue. Do you know the order loading the .so files in Linux. In Windows is specified in the ai.djl.pytorch.jni.LibUtils class code but for Linux it seems that it only loads the libdjl_torch.so. Does this native library loads the rest of the code? Regards, Carlos

carlosuc3m avatar Jan 09 '22 17:01 carlosuc3m

@carlosuc3m For PyTorch 1.9.1 and earlier, you just need to load libdjl_torch.so file (you must put this file in the same folder as libtorch.so).

In PyTorch 1.10.0, we manually load .so file in the following order:

  1. All files that not contains "torch", "caffe2" and "cudnn"
  2. Load PyTorch specific .so file in the following order:
    • libfbgemm
    • libcaffe2_nvrtc
    • libtorch_cpu
    • libc10_cuda
    • libtorch_cuda_cpp
    • libtorch_cuda_cu
    • libtorch_cuda
    • libtorch
    • libdjl_torch

see: https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-engine/src/main/java/ai/djl/pytorch/jni/LibUtils.java#L94

frankfliu avatar Jan 09 '22 18:01 frankfliu

Hello again, I am still working on this. I have observed that in Linux some native libraries are impossible to unload. However in windows the library that causes the problem is the jni dll. Do you think it can be solved with any workaround? I've tried breaking reflection to force the unload of native libraries but I would like to avoid it. I also created an issue in stackoverflow: https://stackoverflow.com/questions/70682562/jni-native-library-avoids-garbage-collection-and-unloading Thank you for your time, Carlos

carlosuc3m avatar Jan 12 '22 14:01 carlosuc3m

@carlosuc3m I don't really know why it's not unloaded, maybe you can use C++ code try to load and unload the share library to see what will happen.

frankfliu avatar Jan 13 '22 22:01 frankfliu

Yes, I ahve tried dlopen and dlclose already and it did not work

carlosuc3m avatar Jan 13 '22 22:01 carlosuc3m

Closing this issue for now since there isn't much we can do. Feel free to reopen this issue if you have new idea.

frankfliu avatar Dec 28 '22 17:12 frankfliu