SynapseML
SynapseML copied to clipboard
Lightgbm bug on big dataset
Describe the bug Everything is fine when I train my lightgbm model on a demo dataset(10k samples).
But when I change the demo dataset to a big one(2.7m samples), I got a Java Runtime Environment Error
To Reproduce
This is my running command:
spark-submit --jars $JARS --packages com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-49-659b7743-SNAPSHOT --executor-memory 500G --driver-memory 100G train_on_spark.py
This is my code snippets:
config_path = 'config.yaml'
config = Config(config_path)
# train_path is the folder path of tfrecords data
df_train = spark.read.format("tfrecords").option("recordType", "Example").load(config.train_path)
label_col = config.label_column_name
feature_combination = config.feature_combination
to_vector = udf(lambda a: Vectors.dense(a), VectorUDT())
raw_data = df_train.select([to_vector(col).alias(col) for col in feature_combination] + [label_col])
featurizer = VectorAssembler(inputCols=feature_combination, outputCol="features")
columns_to_model = ["features", label_col]
train_data = featurizer.transform(raw_data)[columns_to_model]
params = translate_param_name_for_spark_model(param_dict=config.params, model_conf=config)
if config.task_type == "regression":
model = LightGBMRegressor(**params)
elif config.task_type == "classification":
model = LightGBMClassifier(**params)
checkpoint = model.fit(train_data)
Info
- MMLSpark Version: mmlspark_2.12:1.0.0-rc3-49-659b7743-SNAPSHOT
- Spark Version: 3.1.1
Stacktrace
...
[LightGBM] [Info] Connected to rank 5
[LightGBM] [Info] Local rank: 4, total number of machines: 6
[LightGBM] [Info] Local rank: 3, total number of machines: 6
[LightGBM] [Info] Local rank: 5, total number of machines: 6
[LightGBM] [Info] Local rank: 2, total number of machines: 6
[LightGBM] [Info] Local rank: 1, total number of machines: 6
[LightGBM] [Info] Local rank: 0, total number of machines: 6
21/06/01 18:00:11 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node09-cpu:40661 in memory (size: 2.6 KiB, free: 159.8 GiB)
21/06/01 18:00:11 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node09-cpu:40661 in memory (size: 5.1 KiB, free: 159.8 GiB)
21/06/01 18:13:08 INFO PythonUDFRunner: Times: total = 790901, boot = 20, init = 527, finish = 790354
21/06/01 18:13:08 INFO LightGBMRegressor: LightGBM task generating dense dataset with 394097 rows and 4719 columns
21/06/01 18:13:58 INFO PythonUDFRunner: Times: total = 841354, boot = 13, init = 582, finish = 840759
21/06/01 18:13:59 INFO LightGBMRegressor: LightGBM task generating dense dataset with 417444 rows and 4719 columns
21/06/01 18:14:35 INFO PythonUDFRunner: Times: total = 878028, boot = 36, init = 543, finish = 877449
21/06/01 18:14:35 INFO LightGBMRegressor: LightGBM task generating dense dataset with 435436 rows and 4719 columns
21/06/01 18:14:56 INFO PythonUDFRunner: Times: total = 898972, boot = 31, init = 527, finish = 898414
21/06/01 18:14:56 INFO LightGBMRegressor: LightGBM task generating dense dataset with 449106 rows and 4719 columns
21/06/01 18:16:02 INFO PythonUDFRunner: Times: total = 965008, boot = 8, init = 586, finish = 964414
21/06/01 18:16:02 INFO LightGBMRegressor: LightGBM task generating dense dataset with 496211 rows and 4719 columns
21/06/01 18:16:27 INFO PythonUDFRunner: Times: total = 990242, boot = 26, init = 502, finish = 989714
21/06/01 18:16:27 INFO LightGBMRegressor: LightGBM task generating dense dataset with 517806 rows and 4719 columns
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f8f8df3a3a0, pid=534525, tid=0x00007f90908c8700
#
# JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-8u292-b10-0ubuntu1~20.04-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 )
# Problematic frame:
# C [lib_lightgbm_swig.so+0x103a0] Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_doubleArray_1setitem+0x0
#
# Core dump written. Default location: /gxr/liuyong/Projects/reorg_CPI/CPI-prediction/core or core.534525
#
# An error report file with more information is saved as:
# /gxr/liuyong/Projects/reorg_CPI/CPI-prediction/hs_err_pid534525.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted (core dumped)
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.
@yongliu9975 sorry about the error you are seeing. Is this dataset sparse by chance? I see:
LightGBM task generating dense dataset with 394097 rows and 4719 columns
I wonder if the data is actually dense for 4719 columns or it should be treated as sparse if there are a lot of zeros there?
It looks like the error is happening when setting the array values based on this from SWIG:
Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_doubleArray_1setitem
I'm guessing it must be happening when setting the label/weights/initial scores columns after the dataset has been created on each node, since the initial LightGBM dataset has been successfully created on all 6 nodes based on this output:
21/06/01 18:16:27 INFO LightGBMRegressor: LightGBM task generating dense dataset with 517806 rows and 4719 columns
I'm not sure why it's happening though. It seems it needs more investigation.
@imatiach-msft Yes, one of the features is sparse(1024 columns). Anything I can do about that?
Is it possible that I set wrong data type for labels in the dataset which is fine on a small dataset?
@yongliu9975 there must be some bug in this:
Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_doubleArray_1setitem+0x0
However I don't see full stack trace so I'm not completely sure which code line is calling it. But, since I see the debug:
LightGBM task generating dense dataset with 517806 rows and 4719 columns
it must mean it's coming from setting something after the dataset has been generated - actually doing a search it can only be this line:
https://github.com/Azure/mmlspark/blob/ae8004afc2924304ce554c1b67e1ad4c316c7100/src/main/scala/com/microsoft/ml/spark/lightgbm/dataset/LightGBMDataset.scala#L124
Actually, I think it can only come from the init score: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/TrainUtils.scala#L178
Are you setting init score somehow? I don't see it in your code. Setting label column would have called the float version instead of the double version of that method.
Thanks for your reply, I dont set init score. And I would try to set different data type for label column.