SynapseML mmlspark.lightgbm.LightGBMRegressor crashes when numIterations is high

mmlspark.lightgbm.LightGBMRegressor crashes when numIterations is high

Open zyxue opened this issue 3 years ago • 3 comments

Describe the bug the program crashes when numIterations gets higher.

When I set numIterations=100, the training works, but I already see warning message at stage 13 like

[LightGBM] [Info] Connected to rank 1=======>                       (3 + 2) / 5]
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Info] Local rank: 0, total number of machines: 2
21/06/03 09:47:44 WARN TaskSetManager: Stage 13 contains a task of very large size (13606 KiB). The maximum recommended task size is 1000 KiB.

When the I set numIterations=3000, it crashes at

[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Info] Local rank: 1, total number of machines: 2
#Stage 12:==================================>                       (3 + 2) / 5]
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000012081da80, pid=29412, tid=0x000000000001e207
#
# JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [lib_lightgbm.dylib+0x40a80]  _ZNK8LightGBM16MultiValDenseBinItE18ConstructHistogramEiiPKfS3_Pd+0x50
#
# Core dump written. Default location: /cores/core or core.29412
#
# An error report file with more information is saved as:
# /path/to/hs_err_pid29412.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
INFO:py4j.java_gateway:Error while receiving.
Traceback (most recent call last):
  File "/path/to/vendor_python/pypi__py4j/py4j/java_gateway.py", line 1207, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/path/to/vendor_python/pypi__py4j/py4j/java_gateway.py", line 1207, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

My dataframe has just 9 features, 5 of which are categorical. There are about 500k rows.

I feel it may be related to use of categorical variables, without which the numIterations can go much higher (e.g. 12000). Also, the cardinality of categories may also matter.

To Reproduce This example with fake data appears to crush when the learningRate is high (3000). It works when learningRate=50 on my laptop.

import string
import pyspark.sql

spark = (
    pyspark.sql.SparkSession.builder.appName("bug-reproduction")
    .config("spark.some.config.option", "some-value")
    # ref: https://github.com/Azure/mmlspark/tree/6aecdf1c0c212950344f210f11aea2dfb8760009#python
    .config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-88-45379694-SNAPSHOT")
    .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
    .getOrCreate()
)

import mmlspark.lightgbm

slugs = list(string.ascii_letters[:33])

df = spark.createDataFrame(
    pd.DataFrame(
        {
            "f3": slugs * int(15151),
            "f4": np.random.random(size=int(15151 * 33)).tolist(),
            "label": np.random.random(size=int(15151 * 33)).tolist(),
        }
    )
)

cat_cols = ["f3"]
num_cols = ["f4"]

string_indexers = [
    pyspark.ml.feature.StringIndexer(
        inputCol=col,
        outputCol=f"c_{col}",
        stringOrderType="alphabetAsc",
        handleInvalid="keep",
    )
    for col in cat_cols
]


featurizer = pyspark.ml.feature.VectorAssembler(
    inputCols=[f"c_{col}" for col in cat_cols] + num_cols,
    outputCol="features",
    handleInvalid="keep",
)

regressor = mmlspark.lightgbm.LightGBMRegressor(
    numIterations=3000,
    learningRate=0.02,
    featuresCol="features",
    labelCol="label",
)

pipeline = pyspark.ml.Pipeline(
    stages=[
        *string_indexers,
        featurizer,
        regressor,
    ]
)

pipeline.fit(df)

The error is like

[LightGBM] [Info] Local rank: 13, total number of machines: 16
[LightGBM] [Info] Local rank: 8, total number of machines: 16
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000001277368a2, pid=39983, tid=0x000000000000d803
#
# JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [lib_lightgbm.dylib+0x1378a2]  .omp_outlined..20+0x152
#
# Core dump written. Default location: /cores/core or core.39983
#
# An error report file with more information is saved as:
# /path/to/hs_err_pid39983.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Expected behavior Training should finish correctly with higher numIterations

Info (please complete the following information):

MMLSpark Version: com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-88-45379694-SNAPSHOT
Spark Version [e.g. 3.0.1]
Spark Platform [e.g. PySpark]

My Questions:

Do I understand correctly that transformed features by StringIndexer will be automatically considered categorical variables?
What's the possible cause of the crash, and what would be the fix, please?

AB#1209504

Jun 03 '21 18:06 zyxue

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

Jun 03 '21 18:06 welcome[bot]

@zyxue sorry I will have to take a look and try to reproduce this first. "Do I understand correctly that transformed features by StringIndexer will be automatically considered categorical variables?" Yes, if you use string indexer, the categorical metadata will be propagated to the slots after vector assembler is run (by slots I mean the individual columns in the vectors). LightGBM in mmlspark automatically recognizes this metadata. "What's the possible cause of the crash, and what would be the fix, please?" Not sure yet, will need to try and reproduce the issue. Based on this:

_ZNK8LightGBM16MultiValDenseBinItE18ConstructHistogramEiiPKfS3_Pd

It looks like there was some issue with constructing the histogram in the native lightgbm code.

Jun 16 '21 15:06 imatiach-msft

I met the same problem when numIterations is set 31, and when set numIterations is to 20 it's OK. But I don't know why

Jun 21 '21 09:06 mengban

SynapseML SynapseML copied to clipboard

mmlspark.lightgbm.LightGBMRegressor crashes when numIterations is high

SynapseML
SynapseML copied to clipboard