SynapseML
SynapseML copied to clipboard
How to perfectly migrate local lightgbm to spark lightgbm without effect change?
Describe the bug Hi, I have tried to migrate the local python lightgbm to spark lightgbm, it successfully trained model but got a quite different result when predicting.
To Reproduce
- the parameters are different, in the local python lightgbm it is:
{
'objective': 'huber',
'alpha': 0.15,
'boosting': 'gbdt',
'num_iterations': 1000,
'learning_rate': 0.1,
'max_depth': -1,
'bin_construct_sample_cnt': 200000,
'min_gain_to_split': 0,
'min_child_weight': 0.001,
'min_data_in_leaf': 20,
'bagging_fraction': 1,
'bagging_freq': 0,
'feature_fraction': 1,
'lambda_l1': 0,
'lambda_l2': 0,
'tree_learner': 'data',
'num_leaves': 31,
'num_threads': 0,
'metric': 'huber'
}
in the mmlspark lightgbm I tried to use the LightGBMRegressor:
LightGBMRegressor(
alpha=0.15,
baggingFraction=1.0,
baggingFreq=0,
# baggingSeed=3,
# boostFromAverage=True,
boostingType='gbdt',
# categoricalSlotIndexes=None,
# categoricalSlotNames=None,
# defaultListenPort=12400,
earlyStoppingRound=5,
featureFraction=1.0,
lambdaL1=0.0,
lambdaL2=0.0,
learningRate=0.1,
# maxBin=255,
maxDepth=-1,
# minSumHessianInLeaf=0.001,
# modelString='',
numIterations=1000,
numLeaves=31,
objective='huber',
# parallelism='data_parallel',
# timeout=1200.0,
# tweedieVariancePower=1.5,
# verbosity=1,
# featuresCol='features',
# labelCol='label',
# predictionCol='prediction',
# validationIndicatorCol=None,
# weightCol=None,
)
But still have parameters setting not the same, they are:
# ‘bin_construct_sample_cnt’: 200000,
# ‘min_gain_to_split’: 0,
# ‘min_child_weight’: 0.001,
# ‘min_data_in_leaf’: 20,
# ‘num_threads’: 0,
# ‘tree_learner’: ‘data’,
# ‘metric’: ‘huber’
Can't find more details in the doc http://mmlspark.azureedge.net/docs/pyspark/LightGBMRegressor.html
- The training way is different In the local lightgbm they are:
sdf1, sdf2 = sdf.randomSplit([0.9, 0.1])
sdf1x = sdf1.select(config['feature_selection'])
sdf1y = sdf1.select(config['label_column_name'])
sdf2x = sdf2.select(config['feature_selection'])
sdf2y = sdf2.select(config['label_column_name'])
pdf2x = sdf2x.toPandas()
pdf2y = sdf2y.toPandas()
pdf1x = sdf1x.toPandas()
pdf1y = sdf1y.toPandas()
bst = None
train_dataset = lgb.Dataset(
pdf1x,
label=pdf1y,
feature_name=config['feature_selection'],
categorical_feature=config['feature_selection']
)
validate_dataset = lgb.Dataset(
pdf2x,
label=pdf2y,
feature_name=config['feature_selection'],
categorical_feature=config['feature_selection'],
)
bst = lgb.train(
config['training_parameters'],
train_dataset,
valid_sets=validate_dataset,
early_stopping_rounds=5,
init_model=bst
)
in the mmlspark lightgbm they are:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
feature_cols = config['feature_selection']
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
pipeline = Pipeline(stages=[assembler, lgb])
model = pipeline.fit(sdf)
I found the fit() function didn't have a parameter for the validation set. So how does the early_stopping round parameter working without validation set?
Does it have any examples or docs about the difference or the training details?
Expected behavior Now the predict results between two implementations has a big difference, Expect a way to have two similar models output
Info (please complete the following information):
- MMLSpark Version: mmlspark_2.11:1.0.0-rc1
- Spark Version: 2.3.7
- Spark Platform: our company self built spark cluster
Additional context The training set scales at billions so the local lightgbm is not enough to use on a single machine.
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.
@silver6wings sorry about the trouble you are having. Do you need to retrain the model? You can pass in the native lightgbm model file from python to the mmlspark-wrapped one - you can take the model you already trained in python and import it for distributed prediction. If you want to retrain a model from scratch, you can do that too. I agree a few parameters are missing, you can get around this for now by passing them through any string parameter with a hack, for example:
objective="regression, bin_construct_sample_cnt=200000"
However, some of the parameters you mentioned do exist, for example the metric parameter here:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L310
Maybe you are not using the latest version? Can you use the version from master, this is one of the latest builds:
Maven Coordinates
com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1-96-fce3c952-SNAPSHOT
Maven Resolver
https://mmlspark.azureedge.net/maven
"I found the fit() function didn't have a parameter for the validation set. So how does the early_stopping round parameter working without validation set?
Does it have any examples or docs about the difference or the training details?"
There is a "validationIndicatorCol" boolean column, it is used like this: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBase.scala#L194
some of the references mentioned in this recent issue might help too: https://github.com/Azure/mmlspark/issues/876
Thanks for your reply! I will try it in the next step.
-
"Do you need to retrain the model?" Yes, I want to retrain the model using a much larger training set and classic lightgbm based in pandas not support to run in the distributed cluster. So I will train it on the spark cluster and using the model locally.
-
For the validation col, I have a more detailed question, I saw the column are in Boolean type, does it means if this field is True means it's used for validation and False means it's used for training? Do I need to transform them into numbers like 0 1 in pyspark dataframe or can directly use them?
BTW, I did use the spark-submit config like below, it should be the newest version.
--repositories=https://mmlspark.azureedge.net/maven \
--packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1 \
@imatiach-msft Hi, for the hack parameters, I tried
objective='huber, bin_construct_sample_cnt=200000, min_gain_to_split=0, min_child_weight=0.001, min_data_in_leaf=20, tree_learner=data, num_threads=0',
but got errors like min_data_in_leaf should be int, but got "20,"
And I tried
objective='huber bin_construct_sample_cnt=200000 min_gain_to_split=0 min_child_weight=0.001 min_data_in_leaf=20 tree_learner=data num_threads=0',
The retrain got success, but the outputed model still have effect gap with the classic lightgbm one. I used the version below
--repositories=https://mmlspark.azureedge.net/maven \
--packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1 \
Now the whole parameters are like below:
LightGBMRegressor(
alpha=0.15,
baggingFraction=1.0,
baggingFreq=0,
# baggingSeed=3,
# boostFromAverage=True,
boostingType='gbdt',
# categoricalSlotIndexes=None,
# categoricalSlotNames=None,
# defaultListenPort=12400,
earlyStoppingRound=5,
featureFraction=1.0,
lambdaL1=0.0,
lambdaL2=0.0,
learningRate=0.1,
# maxBin=255,
maxDepth=-1,
metric='huber',
# minSumHessianInLeaf=0.001,
# modelString='',
numIterations=1000,
numLeaves=31,
objective='huber, bin_construct_sample_cnt=200000, min_gain_to_split=0, min_child_weight=0.001, min_data_in_leaf=20, tree_learner=data, num_threads=0',
# parallelism='data_parallel',
# timeout=1200.0,
# tweedieVariancePower=1.5,
# verbosity=1,
featuresCol='features',
labelCol='label',
predictionCol='prediction',
validationIndicatorCol='validationCol',
# weightCol=None,
)
And the classic lightgbm parameters are:
'training_parameters': {
'alpha': 0.15,
'bagging_fraction': 1,
'bagging_freq': 0,
'bin_construct_sample_cnt': 200000,
'boosting': 'gbdt',
'feature_fraction': 1,
'lambda_l1': 0,
'lambda_l2': 0,
'learning_rate': 0.1,
'max_depth': -1,
'metric': 'huber'
'min_gain_to_split': 0,
'min_child_weight': 0.001,
'min_data_in_leaf': 20,
'num_leaves': 31,
'num_iterations': 1000,
'num_threads': 0,
'objective': 'huber',
'tree_learner': 'data',
},
should I expect the same training set and parameters will produce similar model output?
@silver6wings exactly, there should be no commas, just like this: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/TrainParams.scala#L46
" but the outputed model still have effect gap with the classic lightgbm one." I wonder is there is some other difference in parameters. How do the models differ?
objective='huber, bin_construct_sample_cnt=200000, min_gain_to_split=0, min_child_weight=0.001, min_data_in_leaf=20, tree_learner=data, num_threads=0'
I think this shouldn't have commas based on the link I sent you
For the validation col, I have a more detailed question, I saw the column are in Boolean type, does it means if this field is True means it's used for validation and False means it's used for training? Do I need to transform them into numbers like 0 1 in pyspark dataframe or can directly use them?
It can be int, double or boolean type, but it is cast to boolean, this is how it is used: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBase.scala#L194 This is where it is cast: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBase.scala#L138
@silver6wings The parameters do look similar. I am wondering if it might be due to early stopping, as that is written in mmlspark from scratch? I guess here are some questions: 1.) Is the data exactly the same? How large is the dataset (num rows and features)? 2.) How much does the model output differ? 3.) Since we are doing distributed training, it's possible output might differ due to how the data is distributed and how histograms are calculated based on data per partition
@silver6wings any way for me to get some repro to look into the issue?
@imatiach-msft I'm deep diving the root cause, currently guess it was caused by the training data set is sparse with random splitting.
@imatiach-msft Hi, I still not find the root cause of the diffs between to training ways.
The model task is to select cdn smartly in specific conditions.
Here is the resource to repro it. The test data is below, I selected 10000 rows of data and factorized them and save them in json file: test_data.json.zip
I trained model using the same data set in two ways using code below: test_mmlspark.py.zip
After training completed I use lightgbm.txt and /lightgbm/stages/1_LightGBMRegressor_xxx/data_1 with chars before 'tree' manually removed to do the compare: test_pred.py.zip
You can see it got different results:

@silver6wings thank you for the repro, I will look into this hopefully soon
@silver6wings I had a chance to look into this today, it seems the difference is caused by the python-based lightgbm treating a lot of the columns as categoricals and the mmlspark-based lightgbm treating all columns as numeric, based on the fact that the mmlspark lightgbm has num_cat=0 everywhere - will update more as I continue to look into it
@silver6wings by adding categoricalSlotIndexes=[0,1,2] parameter I was able to get almost identical results, still looking into the slight difference left:
return LightGBMRegressor(
alpha=0.15,
baggingFraction=1.0,
baggingFreq=0,
# baggingSeed=3,
# boostFromAverage=True,
boostingType='gbdt',
categoricalSlotIndexes=[0,1,2],
....
A-af B-af A-sf B-sf absolute_diff relative_diff
0 0.007375 0.008355 0.001388 0.001074 0.000647 0.142169 1 0.001806 0.001666 0.001908 0.002038 0.000135 0.072941 2 0.000930 0.000789 0.000925 0.000761 0.000152 0.178918 3 0.000610 0.000459 0.000748 0.000708 0.000095 0.150985 4 0.001209 0.000891 0.000983 0.000720 0.000291 0.306022 .. ... ... ... ... ... ... 95 0.001927 0.001955 0.003437 0.003445 0.000018 0.006647 96 0.001696 0.001970 0.000908 0.000849 0.000167 0.122890 97 0.001029 0.000868 0.001024 0.001121 0.000129 0.127983 98 0.007281 0.007870 0.002462 0.002464 0.000296 0.058914 99 0.005725 0.006117 0.004695 0.005144 0.000421 0.077612
[100 rows x 6 columns] 100 absolute_diff:0.00023218528296976653 relative_diff:0.15840692750538335
@silver6wings another difference I noticed is that bin_construct_sample_cnt is a dataset parameter for lightgbm, so it can't be passed the way it is done in the notebook
@silver6wings I updated to rc2 in the notebook which has this as a new parameter and set:
binSampleCount=200000,
This seems to lead to closer results, but there is still a slight difference:
A-af B-af A-sf B-sf
0 0.007403 0.007062 0.001269 0.001215 1 0.001784 0.001779 0.001781 0.001912 2 0.000867 0.000654 0.000774 0.000676 3 0.000759 0.000917 0.000356 0.000193 4 0.001266 0.001121 0.000563 0.000435 .. ... ... ... ... 95 0.001690 0.001873 0.003303 0.003343 96 0.001599 0.002798 0.001011 0.000958 97 0.000789 0.000529 0.001765 0.001353 98 0.004088 0.003939 0.002505 0.002492 99 0.004647 0.004228 0.005227 0.005387
[100 rows x 4 columns] 100 absolute_diff:0.00019000222360903717 relative_diff:0.1399695661281016
@silver6wings interestingly, when I upgrade lightgbm to latest version 3.0.0 I see an even larger difference in the predictions, I wonder if I need to ensure the native bits are exactly the same for a fair comparison between the python package and mmlspark
@imatiach-msft Thank you very much for your effort! ❤ Now I can try using them to do the test in a new round. My next target should be to compare which will have less loss in the prediction tasks. If it is better then mmlspark will be the better choice.
Why we cant use spark DF directly as an input for LightGBM?
I am currently encountering this difference as well. I am using an imbalanced dataset where synapseml lightgbm classifier performs so poor (0.53 AUC) even with isUnbalance set to True. whereas in Python lightgbm, I am getting 0.70+ AUC.
It seems out of the box parameters of recent versions of Python’s lightgbm is way better than synapseml.
What are the parameters that I can tweak so I can achieve results closer to the Python version given that I am dealing with an imbalanced dataset?
EDIT: My problem is more related to #1478 it seems.
For poor souls like me encountering this problem, kindly refer to the SNAPSHOT build mentioned in the #1478 thread for the fix. Version 0.9.6 will address this AUC loss bug as per comments of @imatiach-msft