lightgbm4j
lightgbm4j copied to clipboard
LGBM_BoosterPredictForMatSingleRow crashes the JVM with EXCEPTION_ACCESS_VIOLATION
Training works fine now, but the application crashes outside the JVM at inference time with EXCEPTION_ACCESS_VIOLATION (0xc0000005) when calling LGBM_BoosterPredictForMatSingleRow.
System/Java version:
JRE version: Java(TM) SE Runtime Environment (17.0.1+12) (build 17.0.1+12-LTS-39)
Java VM: Java HotSpot(TM) 64-Bit Server VM (17.0.1+12-LTS-39, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, windows-amd64)
Error: EXCEPTION_ACCESS_VIOLATION (0xc0000005), data execution prevention violation at address 0x000000000000000f
Details (log attached) hs_err_pid55304.log :
--------------- T H R E A D ---------------
Current thread (0x000002d6fc0f5c90): JavaThread "Thread-4" [_thread_in_native, id=54640, stack(0x000000e7ab280000,0x000000e7ab300000)]
Stack: [0x000000e7ab280000,0x000000e7ab300000], sp=0x000000e7ab2fe138, free space=504k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C 0x000000000000000f
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_BoosterPredictForMatSingleRow(JJIIIIIILjava/lang/String;JJ)I+0 j com.microsoft.ml.lightgbm.lightgbmlib.LGBM_BoosterPredictForMatSingleRow(Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_void;Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_void;IIIIIILjava/lang/String;Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_long_long;Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_double;)I+30 j io.github.metarank.lightgbm4j.LGBMBooster.predictForMatSingleRow([FLcom/microsoft/ml/lightgbm/PredictionType;)D+88 j weka.classifiers.functions.LGBMClassifier.classifyInstance(Lweka/core/Instance;)D+49 j knowledgeConnector.externalMLMethods.WekaClassifierAdapter.classifyInstance(Lweka/classifiers/Classifier;Lweka/core/Instance;ZI)D+107 j util.TrainEvaluation.evaluateClassificationQuality(Ljava/util/Vector;ZLweka/classifiers/Classifier;Lweka/core/Instances;IZZZ)Ljava/util/Map;+1232 j control.MFNController.evaluateClones(Ljava/util/List;Lweka/core/Instances;Ljava/util/Vector;ZZLjava/util/List;ZZ)[D+402 j control.MFNController.evaluateClones(Ljava/util/List;Lweka/core/Instances;Ljava/util/Vector;ZZLjava/util/List;Z)[D+13 j control.MFNController.generatePQReportWekaClassifier(Ljava/lang/String;Ljava/util/Vector;Ljava/util/Map;Ljava/lang/String;ZZZIZZLjava/util/List;)I+1710 j control.ParameterSpaceExplorationAgent.run()V+9109 v ~StubRoutines::call_stub
siginfo: EXCEPTION_ACCESS_VIOLATION (0xc0000005), data execution prevention violation at address 0x000000000000000f
Hi @jornd13, kudos for the complete hs_err log! There are some strange things I observe there:
- we have a CI with matrix tests to validate that the library works on Mac/Linux/Windows and on JDK 11/17/21. The testsuite run over windows + JDK17 is green: https://github.com/metarank/lightgbm4j/actions/runs/7103556567
- I see your classpath is full of dependencies:
java_class_path (initial): .
java_class_path (initial): .;../l/deeplearning4j-core-1.0.0-M2.jar;../l/javax.activation-1.2.0.jar;../l/deeplearning4j-datasets-1.0.0-M2.jar;../l/deeplearning4j-datavec-iterators-1.0.0-M2.jar;../l/deeplearning4j-modelimport-1.0.0-M2.jar;../l/gson-2.8.0.jar;../l/hdf5-platform-1.12.1-1.5.7.jar;../l/hdf5-1.12.1-1.5.7.jar;../l/hdf5-1.12.1-1.5.7-windows-x86.jar;../l/hdf5-1.12.1-1.5.7-windows-x86_64.jar;../l/slf4j-api-1.7.21.jar;../l/deeplearning4j-nn-1.0.0-M2.jar;../l/deeplearning4j-utility-iterators-1.0.0-M2.jar;../l/nd4j-common-1.0.0-M2.jar;../l/guava-1.0.0-M2.jar;../l/fastutil-6.5.7.jar;../l/commons-io-2.7.jar;../l/commons-compress-1.21.jar;../l/nd4j-api-1.0.0-M2.jar;../l/byteunits-0.9.1.jar;../l/commons-collections4-4.1.jar;../l/flatbuffers-java-1.12.0.jar;../l/protobuf-1.0.0-M2.jar;../l/commons-net-3.1.jar;../l/neoitertools-1.0.0.jar;../l/commons-lang3-3.6.jar;../l/jackson-1.0.0-M2.jar;../l/datavec-api-1.0.0-M2.jar;../l/freemarker-2.3.23.jar;../l/stream-2.9.8.jar;../l/opencsv-2.3.jar;../l/t-digest-3.2.jar;../l/datavec-data-image-1.0.0-M2.jar;../l/jai-imageio-core-1.3.0.jar;../l/imageio-jpeg-3.1.1.jar;../l/imageio-core-3.1.1.jar;../l/imageio-metadata-3.1.1.jar;../l/common-lang-3.1.1.jar;../l/common-io-3.1.1.jar;../l/common-image-3.1.1.jar;../l/imageio-tiff-3.1.1.jar;../l/imageio-psd-3.1.1.jar;../l/imageio-bmp-3.1.1.jar;../l/javacv-1.5.7.jar;../l/openblas-0.3.19-1.5.7.jar;../l/ffmpeg-5.0-1.5.7.jar;../l/flycapture-2.13.3.31-1.5.7.jar;../l/libdc1394-2.2.6-1.5.7.jar;../l/libfreenect-0.5.7-1.5.7.jar;../l/libfreenect2-0.2.0-1.5.7.jar;../l/librealsense-1.12.4-1.5.7.jar;../l/librealsense2-2.50.0-1.5.7.jar;../l/videoinput-0.200-1.5.7.jar;../l/artoolkitplus-2.3.1-1.5.7.jar;../l/flandmark-1.07-1.5.7.jar;../l/leptonica-1.82.0-1.5.7.jar;../l/tesseract-5.0.1-1.5.7.jar;../l/openblas-platform-0.3.19-1.5.7.jar;../l/openblas-0.3.19-1.5.7-windows-x86.jar;../l/openblas-0.3.19-1.5.7-windows-x86_64.jar;../l/leptonica-platform-1.82.0-1.5.7.jar;../l/leptonica-1.82.0-1.5.7-windows-x86.jar;
-
I often seen such issues when you have multiple JNI libraries loaded at the same time which use different msvcrt runtimes. You have one which is a bit suspicious:
0x000000006b3c0000 - 0x000000006b993000 C:\Users\joern\AppData\Local\Temp\jniloader3739708092842027418netlib-native_ref-win-x86_64.dll
-
so I guess it's again related to some of your dependencies, but I can't say which one without having access to code.
It would be great if you make a reproducer for this case which can be [semi]publically shared so I can take a look.
Thank you! I dug in some more and the issue revolves around this warning that is thrown before the JVM crashes: "WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemARPACK"
Somehow the wrong dll gets loaded or is missing entirely. If this rings a bell please let me know.
I'm not even sure what triggers loading of this native linalg library. It's not in my Java code I suppose. Training runs just fine. Inference is invoked in this simple snippet:
public double classifyInstance(Instance instance) throws Exception { int nFeat = instance.numAttributes() - 1; float[] input = new float[nFeat]; for (int i = 0; i < nFeat; i++) { input[i] = (float) instance.value(i); } //double[] predArr = booster.predictForMat(input, 1, nFeat, true, PredictionType.C_API_PREDICT_NORMAL); double pred = booster.predictForMatSingleRow(input, PredictionType.C_API_PREDICT_NORMAL); //double pred = booster.predictForMatSingleRow(input); return pred; }
The Java package com.github.fommil.netlib is not even required by metarank, is it? (not being imported)
Even if I include it in my POM via
... I get the same error at runtime. It is an old library btw. There is a newer version by dev.ludovic.netlib, but it does not solve the issue either.
The key to debug this is knowing what makes it even want to have that library at runtime?
Apr 28, 2024 10:14:15 AM com.github.fommil.netlib.ARPACK
I will keep searching nevertheless, but appreciate any further hints! Thanks a lot.
The problem does not seem to be related to the ARPACK libs, but revolves around the MSVCP140.dll being used at runtime (see below). I don't have experience with JNI and it seems to be some compatibility issue like you indicated. Is there any documentation on JNI usage for LightGBM and requirements in terms of dll versions etc.?
A fatal error has been detected by the Java Runtime Environment:
EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000007feee273278, pid=40024, tid=0x000000000000b568
JRE version: Java(TM) SE Runtime Environment (8.0_151-b12) (build 1.8.0_151-b12) Java VM: Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode windows-amd64 compressed oops) Problematic frame: C [MSVCP140.dll+0x13278]
(the same issue arises when using Java 11 like in the previous post)
The main problem is that we don't build the win64 binaries for the lightgbm.dll
by ourselves, but just bundle the ones made by upstream LightGBM project. For me it seems to be some sort of compatibility issue between ARPACK and LightGBM dlls built for different versions of MSVCP.
Thank you. It's definitely extremely hard to get to the bottom of this. In the meanwhile I have busted a whole bunch of prior assumptions:
- it is not related to the NativeSystemARPACK lib, the netlib-native_ref-win-x86_64 or any of these dlls. Even when they are not present, it crashes at the same position (see extracted native libs at runtime below *)). -- at training and inference time the loaded native dlls are exactly the same, so that cannot be the issue.
- the Java version for that matter is not the problem (both Java 8 and 11 fail identically)
- it is not even related to the particular method call of booster.predictForMat(...); this works just fine if I call it at training time earlier in the application flow - then it works just fine.
To me it means that somehow the system/JVM state changes in non-trivial ways in between training and inference such that the same call then eventually crashes the JVM. Very odd indeed.
Does any of this help you to narrow down further? Thanks a lot!
*) I:\Programme\Java\jdk1.8.0_151\jre\bin\zip.dll I:\Programme\Java\jdk1.8.0_151\jre\bin\awt.dll I:\Programme\Java\jdk1.8.0_151\jre\bin\fontmanager.dll I:\Programme\Java\jdk1.8.0_151\jre\bin\net.dll I:\Programme\Java\jdk1.8.0_151\jre\bin\nio.dll I:\Programme\Java\jdk1.8.0_151\jre\bin\t2k.dll C:\Users*\AppData\Local\Temp\lib_lightgbm.dll C:\Users*\AppData\Local\Temp\lib_lightgbm_swig.dll
It turns out that this issue is completely unrelated to incompatibility of native libs.
Instead, I think I discovered a major issue with the lightgbm4j API. Instead of throwing a useful exception, the entire JVM crahes when trying to invoke inference (predictForMatSingleRow(...)) on a closed booster (booster.close()).
@shuttie I'd suggest to rework that part of the API accordingly. It's hard to imagine that this has not been an issue for others so far - maybe they simply remained quiet about it.
Can you please make a clean reproducer for the issue?
- it can be shared in public - so I can reproduce the issue locally. As for now you're the only one having access to the problematic code.
- it has no dependencies on private packages, and optionally should have minimal amount of these. In a perfect case it should only depend on LightGBM4j.
- it doesn't require any private datasets.
I am not sure how to do that exactly. It's probably not necessary either, since you can take your own snippet code from the main wikipage and append 2 simple lines at the end, and it will crash on Windows - no dependency on data or libraries. I have done this for you:
LGBMDataset train = LGBMDataset.createFromFile("cancer.csv", "header=true label=name:Classification", null); LGBMDataset test = LGBMDataset.createFromFile("cancer-test.csv", "header=true label=name:Classification", train); LGBMBooster booster = LGBMBooster.create(train, "objective=binary label=name:Classification"); booster.addValidData(test);
for (int i=0; i<10; i++) { booster.updateOneIter(); double[] evalTrain = booster.getEval(0); double[] evalTest = booster.getEval(1); System.out.println("train: " + eval[0] + " test: " + ); } booster.close();
/** added lines to trigger inference in a trivial way */ float[] input = new float[2]; double pred = booster.predictForMatSingleRow(input, PredictionType.C_API_PREDICT_NORMAL);
@jornd13
- the JVM crash with use-after-close is fixed in https://github.com/metarank/lightgbm4j/pull/82
- there's also a fix https://github.com/metarank/lightgbm4j/pull/86 which may affect yours ACCESS_VIOLATION issue - there might be stale
lib_lightgbm_swig.dll
living in your tmp folder. - bumped to the upstream 4.3.0 version.
So please try the new version, and report if it fixes your issue.
Thanks a lot! Yes, use-after-close is fixed now. Closing the issue.