SynapseML Lightgbm bug on big dataset

Describe the bug Everything is fine when I train my lightgbm model on a demo dataset(10k samples).

But when I change the demo dataset to a big one(2.7m samples), I got a Java Runtime Environment Error

To Reproduce

This is my running command:

spark-submit --jars $JARS --packages com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-49-659b7743-SNAPSHOT --executor-memory 500G --driver-memory 100G  train_on_spark.py

This is my code snippets:

config_path = 'config.yaml'
config = Config(config_path)
# train_path is the folder path of tfrecords data
df_train = spark.read.format("tfrecords").option("recordType", "Example").load(config.train_path)

label_col = config.label_column_name
feature_combination = config.feature_combination

to_vector = udf(lambda a: Vectors.dense(a), VectorUDT())
raw_data = df_train.select([to_vector(col).alias(col) for col in feature_combination] + [label_col])

featurizer = VectorAssembler(inputCols=feature_combination, outputCol="features")
columns_to_model = ["features", label_col]

train_data = featurizer.transform(raw_data)[columns_to_model]

params = translate_param_name_for_spark_model(param_dict=config.params, model_conf=config)

if config.task_type == "regression":
    model = LightGBMRegressor(**params)
elif config.task_type == "classification":
    model = LightGBMClassifier(**params)

checkpoint = model.fit(train_data)

Info

MMLSpark Version: mmlspark_2.12:1.0.0-rc3-49-659b7743-SNAPSHOT
Spark Version: 3.1.1

Stacktrace

...
[LightGBM] [Info] Connected to rank 5
[LightGBM] [Info] Local rank: 4, total number of machines: 6
[LightGBM] [Info] Local rank: 3, total number of machines: 6
[LightGBM] [Info] Local rank: 5, total number of machines: 6
[LightGBM] [Info] Local rank: 2, total number of machines: 6
[LightGBM] [Info] Local rank: 1, total number of machines: 6
[LightGBM] [Info] Local rank: 0, total number of machines: 6
21/06/01 18:00:11 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node09-cpu:40661 in memory (size: 2.6 KiB, free: 159.8 GiB)
21/06/01 18:00:11 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node09-cpu:40661 in memory (size: 5.1 KiB, free: 159.8 GiB)
21/06/01 18:13:08 INFO PythonUDFRunner: Times: total = 790901, boot = 20, init = 527, finish = 790354
21/06/01 18:13:08 INFO LightGBMRegressor: LightGBM task generating dense dataset with 394097 rows and 4719 columns
21/06/01 18:13:58 INFO PythonUDFRunner: Times: total = 841354, boot = 13, init = 582, finish = 840759
21/06/01 18:13:59 INFO LightGBMRegressor: LightGBM task generating dense dataset with 417444 rows and 4719 columns
21/06/01 18:14:35 INFO PythonUDFRunner: Times: total = 878028, boot = 36, init = 543, finish = 877449
21/06/01 18:14:35 INFO LightGBMRegressor: LightGBM task generating dense dataset with 435436 rows and 4719 columns
21/06/01 18:14:56 INFO PythonUDFRunner: Times: total = 898972, boot = 31, init = 527, finish = 898414
21/06/01 18:14:56 INFO LightGBMRegressor: LightGBM task generating dense dataset with 449106 rows and 4719 columns
21/06/01 18:16:02 INFO PythonUDFRunner: Times: total = 965008, boot = 8, init = 586, finish = 964414
21/06/01 18:16:02 INFO LightGBMRegressor: LightGBM task generating dense dataset with 496211 rows and 4719 columns
21/06/01 18:16:27 INFO PythonUDFRunner: Times: total = 990242, boot = 26, init = 502, finish = 989714
21/06/01 18:16:27 INFO LightGBMRegressor: LightGBM task generating dense dataset with 517806 rows and 4719 columns
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f8f8df3a3a0, pid=534525, tid=0x00007f90908c8700
#
# JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-8u292-b10-0ubuntu1~20.04-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 )
# Problematic frame:
# C  [lib_lightgbm_swig.so+0x103a0]  Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_doubleArray_1setitem+0x0
#
# Core dump written. Default location: /gxr/liuyong/Projects/reorg_CPI/CPI-prediction/core or core.534525
#
# An error report file with more information is saved as:
# /gxr/liuyong/Projects/reorg_CPI/CPI-prediction/hs_err_pid534525.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted (core dumped)

AB#1205746

Jun 01 '21 11:06 yongliu9975

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

Jun 01 '21 11:06 welcome[bot]

@yongliu9975 sorry about the error you are seeing. Is this dataset sparse by chance? I see:

LightGBM task generating dense dataset with 394097 rows and 4719 columns

I wonder if the data is actually dense for 4719 columns or it should be treated as sparse if there are a lot of zeros there?

It looks like the error is happening when setting the array values based on this from SWIG:

Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_doubleArray_1setitem

I'm guessing it must be happening when setting the label/weights/initial scores columns after the dataset has been created on each node, since the initial LightGBM dataset has been successfully created on all 6 nodes based on this output:

21/06/01 18:16:27 INFO LightGBMRegressor: LightGBM task generating dense dataset with 517806 rows and 4719 columns

I'm not sure why it's happening though. It seems it needs more investigation.

Jun 02 '21 05:06 imatiach-msft

@imatiach-msft Yes, one of the features is sparse(1024 columns). Anything I can do about that?

Jun 02 '21 12:06 yongliu9975

Is it possible that I set wrong data type for labels in the dataset which is fine on a small dataset?

Jun 02 '21 12:06 yongliu9975

@yongliu9975 there must be some bug in this:

 Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_doubleArray_1setitem+0x0

However I don't see full stack trace so I'm not completely sure which code line is calling it. But, since I see the debug:

LightGBM task generating dense dataset with 517806 rows and 4719 columns

it must mean it's coming from setting something after the dataset has been generated - actually doing a search it can only be this line:

https://github.com/Azure/mmlspark/blob/ae8004afc2924304ce554c1b67e1ad4c316c7100/src/main/scala/com/microsoft/ml/spark/lightgbm/dataset/LightGBMDataset.scala#L124

Actually, I think it can only come from the init score: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/TrainUtils.scala#L178

Are you setting init score somehow? I don't see it in your code. Setting label column would have called the float version instead of the double version of that method.

Jun 03 '21 06:06 imatiach-msft

Thanks for your reply, I dont set init score. And I would try to set different data type for label column.

Jun 03 '21 08:06 yongliu9975

SynapseML SynapseML copied to clipboard

Lightgbm bug on big dataset

SynapseML
SynapseML copied to clipboard