djl
djl copied to clipboard
pytorch-engine:0.18.0 causes memory leak when using NDManager.newBaseManager()
Description
- When using pytorch-engine:0.18.0
NDManager.newBaseManager()
creates aPtNDManager
, it will callai.djl.pytorch.engine.PtNDManager#newSubManager
, and execute:
PtNDManager manager = new PtNDManager(this, device);
attachUncappedInternal(manager.uid, manager);
return manager;
- Method
attachUncappedInternal
is implemented byBaseNDManager
and attaches the created PtNDManager to its fieldresources
.
resources.put(resourceId, resource);
- The created PtNDManger will never be released even it is closed.
public void close() {
if (!closed.getAndSet(true)) {
// ignore some code
parent.detachInternal(uid);
resources.clear();
tempResources.clear();
}
}
The parent
is PtNDManager$SystemManager
and parent's detachInternal
does nothing.
@Override
public void detachInternal(String resourceId) {}
So in the end, the created PtNDManger will not be sweeped by JVM GC.
- When downgrade pytorch-engine to version 0.17.0, the problem is solved. Because the
newSubManager
callsPtNDManager$SystemManger#attachInternal
.PtNDManager$SystemManger#attachInternal
does nothing.
PtNDManager manager = new PtNDManager(this, device);
attachInternal(manager.uid, manager);
return manager;
@Override
public void attachInternal(String resourceId, AutoCloseable resource) {}
Expected Behavior
The SystemManager
will not attach the created PtNDManger to its field resources
or release PtNDManger when it is closed.
Error Message
data:image/s3,"s3://crabby-images/63f45/63f45c8153b0c4fd8136fbb0826c46f8426adbab" alt="image"
How to Reproduce?
- use pytorch-engine version 0.18.0
- execute the code below as many times as possible and will cause OOM eventually.
try (NDManager manager = NDManager.newBaseManager(Device.cpu())) {
// do something here
}
- maven dependencies
<dependency>
<groupId>ai.djl.pytorch</groupId>
<artifactId>pytorch-engine</artifactId>
<version>0.18.0</version>
<exclusions>
<exclusion>
<artifactId>jna</artifactId>
<groupId>net.java.dev.jna</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>net.java.dev.jna</groupId>
<artifactId>jna</artifactId>
<version>5.9.0</version>
</dependency>
<!--For Pre-CXX11 build -->
<dependency>
<groupId>ai.djl.pytorch</groupId>
<artifactId>pytorch-native-cpu-precxx11</artifactId>
<classifier>linux-x86_64</classifier>
<version>1.11.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>ai.djl.pytorch</groupId>
<artifactId>pytorch-jni</artifactId>
<version>1.11.0-0.18.0</version>
<scope>runtime</scope>
</dependency>
<!-- windows -->
<dependency>
<groupId>ai.djl.pytorch</groupId>
<artifactId>pytorch-native-cpu</artifactId>
<classifier>win-x86_64</classifier>
<scope>runtime</scope>
<version>1.11.0</version>
</dependency>
<dependency>
<groupId>ai.djl.pytorch</groupId>
<artifactId>pytorch-jni</artifactId>
<version>1.11.0-0.18.0</version>
<scope>runtime</scope>
</dependency
Thanks for your fix contribution, we will track on that