SynapseML
SynapseML copied to clipboard
mmlspark.lightgbm.LightGBMRegressor crashes when numIterations is high
Describe the bug
the program crashes when numIterations
gets higher.
When I set numIterations=100
, the training works, but I already see warning message at stage 13 like
[LightGBM] [Info] Connected to rank 1=======> (3 + 2) / 5]
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Info] Local rank: 0, total number of machines: 2
21/06/03 09:47:44 WARN TaskSetManager: Stage 13 contains a task of very large size (13606 KiB). The maximum recommended task size is 1000 KiB.
When the I set numIterations=3000
, it crashes at
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Info] Local rank: 1, total number of machines: 2
#Stage 12:==================================> (3 + 2) / 5]
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x000000012081da80, pid=29412, tid=0x000000000001e207
#
# JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C [lib_lightgbm.dylib+0x40a80] _ZNK8LightGBM16MultiValDenseBinItE18ConstructHistogramEiiPKfS3_Pd+0x50
#
# Core dump written. Default location: /cores/core or core.29412
#
# An error report file with more information is saved as:
# /path/to/hs_err_pid29412.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
INFO:py4j.java_gateway:Error while receiving.
Traceback (most recent call last):
File "/path/to/vendor_python/pypi__py4j/py4j/java_gateway.py", line 1207, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/path/to/vendor_python/pypi__py4j/py4j/java_gateway.py", line 1207, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
My dataframe has just 9 features, 5 of which are categorical. There are about 500k rows.
I feel it may be related to use of categorical variables, without which the numIterations can go much higher (e.g. 12000). Also, the cardinality of categories may also matter.
To Reproduce
This example with fake data appears to crush when the learningRate is high (3000). It works when learningRate=50
on my laptop.
import string
import pyspark.sql
spark = (
pyspark.sql.SparkSession.builder.appName("bug-reproduction")
.config("spark.some.config.option", "some-value")
# ref: https://github.com/Azure/mmlspark/tree/6aecdf1c0c212950344f210f11aea2dfb8760009#python
.config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-88-45379694-SNAPSHOT")
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
.getOrCreate()
)
import mmlspark.lightgbm
slugs = list(string.ascii_letters[:33])
df = spark.createDataFrame(
pd.DataFrame(
{
"f3": slugs * int(15151),
"f4": np.random.random(size=int(15151 * 33)).tolist(),
"label": np.random.random(size=int(15151 * 33)).tolist(),
}
)
)
cat_cols = ["f3"]
num_cols = ["f4"]
string_indexers = [
pyspark.ml.feature.StringIndexer(
inputCol=col,
outputCol=f"c_{col}",
stringOrderType="alphabetAsc",
handleInvalid="keep",
)
for col in cat_cols
]
featurizer = pyspark.ml.feature.VectorAssembler(
inputCols=[f"c_{col}" for col in cat_cols] + num_cols,
outputCol="features",
handleInvalid="keep",
)
regressor = mmlspark.lightgbm.LightGBMRegressor(
numIterations=3000,
learningRate=0.02,
featuresCol="features",
labelCol="label",
)
pipeline = pyspark.ml.Pipeline(
stages=[
*string_indexers,
featurizer,
regressor,
]
)
pipeline.fit(df)
The error is like
[LightGBM] [Info] Local rank: 13, total number of machines: 16
[LightGBM] [Info] Local rank: 8, total number of machines: 16
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00000001277368a2, pid=39983, tid=0x000000000000d803
#
# JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C [lib_lightgbm.dylib+0x1378a2] .omp_outlined..20+0x152
#
# Core dump written. Default location: /cores/core or core.39983
#
# An error report file with more information is saved as:
# /path/to/hs_err_pid39983.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Expected behavior
Training should finish correctly with higher numIterations
Info (please complete the following information):
- MMLSpark Version:
com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-88-45379694-SNAPSHOT
- Spark Version [e.g. 3.0.1]
- Spark Platform [e.g. PySpark]
My Questions:
- Do I understand correctly that transformed features by StringIndexer will be automatically considered categorical variables?
- What's the possible cause of the crash, and what would be the fix, please?
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.
@zyxue sorry I will have to take a look and try to reproduce this first. "Do I understand correctly that transformed features by StringIndexer will be automatically considered categorical variables?" Yes, if you use string indexer, the categorical metadata will be propagated to the slots after vector assembler is run (by slots I mean the individual columns in the vectors). LightGBM in mmlspark automatically recognizes this metadata. "What's the possible cause of the crash, and what would be the fix, please?" Not sure yet, will need to try and reproduce the issue. Based on this:
_ZNK8LightGBM16MultiValDenseBinItE18ConstructHistogramEiiPKfS3_Pd
It looks like there was some issue with constructing the histogram in the native lightgbm code.
I met the same problem when numIterations is set 31, and when set numIterations is to 20 it's OK. But I don't know why