SynapseML
SynapseML copied to clipboard
LightGBMRegressor not deterministic with deterministic=True, seed=777, force_col_wise=True
SynapseML version
com.microsoft.azure:synapseml_2.12:0.11.0
System information
- Language version (e.g. python 3.8, scala 2.12): Python 3.9.5, Scala 2.12
- Spark Version (e.g. 3.2.3): 3.3.0
- Spark Platform (e.g. Synapse, Databricks): Databricks
Databricks version is "11.3 LTS ML (includes Apache Spark 3.3.0, Scala 2.12)"
I am running up to 10 worker nodes of type "i3.4xlarge" (122 GB Memory, 16 cores) and my driver type is "i3.8xlarge" (244 GB Memory, 32 cores).
Describe the problem
I am not sure if this is a bug or just an input I am missing. I am training on a large dataset and running the same code twice does not give me the same output. I have recreated the problem below with a toy dataset.
On smallish datasets (say 300K), I get deterministic results, but when I increase the size of the dataset (to say 3 million), results are no longer deterministic. Note that they are not always different so the below code will sometimes produce the same result for me. But sometimes I have prediction_one not equal to prediction_two. I would say they are unequal more often than not.
I was under the impression that to have deterministic results you needed to set:
- deterministic=True
- seed=777
- force_col_wise=True
Is there more you need to do to get deterministic results?
Code to reproduce issue
from synapse.ml.lightgbm import LightGBMRegressor
import numpy as np
import pandas as pd
from pyspark.ml.feature import VectorAssembler
# create data
np.random.seed(42)
num_features = 88
data_length = 3_000_000
data = pd.DataFrame({f'feature_{i}': np.random.random(data_length) for i in range(num_features)})
data['label'] = 0
for feature_num in range(num_features):
data[f'feature_{feature_num}'] *= np.where(np.random.random(data_length) < 0.5, 0, 1)
data['label'] += data[f'feature_{feature_num}'] ** feature_num
feature_names = [i for i in data if i != 'label']
# convert to trainable format
train = spark.createDataFrame(data)
featurizer = VectorAssembler(inputCols=feature_names, outputCol="features")
train_data = featurizer.transform(train)['label', "features"]
# fit a model on the data and then calculate its predicted value for the data
# if deterministic we would always expect this to give the same output for the same input
def fit_model_and_get_predictions(input_train_data):
model = LightGBMRegressor(
objective="regression", learningRate=0.1, numLeaves=30, deterministic=True, numIterations=200, seed=777,
).fit(input_train_data)
model.passThroughArgs = "force_col_wise=True"
predictions = [i[0] for i in model.transform(train_data).select('prediction').collect()]
return predictions
# run the same code twice
prediction_one = fit_model_and_get_predictions(train_data)
prediction_two = fit_model_and_get_predictions(train_data)
# predictions are different, below should evaluate as False
# it may sometimes evaluate as True, as the non-determinism itself seems non-deterministic,
# but a rerun (or a few reruns) should return False. For me it is False about 3 out of 4 times
prediction_one == prediction_two
Other info / logs
No response
What component(s) does this bug affect?
- [ ]
area/cognitive
: Cognitive project - [ ]
area/core
: Core project - [ ]
area/deep-learning
: DeepLearning project - [X]
area/lightgbm
: Lightgbm project - [ ]
area/opencv
: Opencv project - [ ]
area/vw
: VW project - [ ]
area/website
: Website - [ ]
area/build
: Project build system - [ ]
area/notebooks
: Samples under notebooks folder - [ ]
area/docker
: Docker usage - [ ]
area/models
: models related issue
What language(s) does this bug affect?
- [ ]
language/scala
: Scala source code - [X]
language/python
: Pyspark APIs - [ ]
language/r
: R APIs - [ ]
language/csharp
: .NET APIs - [ ]
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
- [ ]
integrations/synapse
: Azure Synapse integrations - [ ]
integrations/azureml
: Azure ML integrations - [X]
integrations/databricks
: Databricks integrations
Hey @TwoFitMitch :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.
@svotaw Can you look into this LightGBM question?
this is more for @imatiach-msft
I just realised that above I am setting the "force_col_wise=True" after running the fit, which presumably doesn't have an effect. However, I just tried changing the function to the below, and the non-determinism remained, so I don't think that is the cause.
def fit_model_and_get_predictions(input_train_data):
model = LightGBMRegressor(
objective="regression", learningRate=0.1, numLeaves=30, deterministic=True, numIterations=200, seed=777,
)
model.passThroughArgs = "force_col_wise=True"
model = model.fit(input_train_data)
predictions = [i[0] for i in model.transform(train_data).select('prediction').collect()]
return predictions
I believe those are the parameters to make LightGBM deterministic. I'm wondering if the issue is with how you are creating the spark dataframe. Perhaps this link might help:
https://stackoverflow.com/questions/55468810/how-do-i-get-deterministic-random-ordering-in-pyspark
When you are doing this conversion and making the data distributed:
train = spark.createDataFrame(data)
I believe different rows could go to different machines/executors, in different orders/ways, which may be causing the non-deterministic behavior.
There is also a chance that some non-deterministic code was added since I last looked into this issue and made the LightGBM deterministic, that is also possible. But that code looks very suspicious to me. If one can save the data to parquet format, and then reload it from disk, and see non-deterministic behavior on a restarted cluster with the same number of machines that would worry me much more.
I'm betting this would fix it for sure:
train_data = train_data.cache()
# run the same code twice
prediction_one = fit_model_and_get_predictions(train_data)
prediction_two = fit_model_and_get_predictions(train_data)
will need to take a look when I have some free time
Thank you very much for your response, adding the cache line does fix the problem. I am a little surprised to find after reading the link you provided that adding this ordering does not fix it (from observation it seems the VectorAssembler removes any ordering which is why I am ordering after the transform):
# convert to trainable format
train = spark.createDataFrame(data)
featurizer = VectorAssembler(inputCols=feature_names, outputCol="features")
train_data = featurizer.transform(train)['label', "features"]
train_data = train_data.orderBy(['label', "features"])
Is this expected/do you think there is any way to achieve the determinism without caching the train_data?
For my use case, I am pulling data from a delta table in databricks and then fitting a model on it, and it would be good if doing this twice resulted in the same output. If the way to achieve this is by caching it, then I'm guessing there would be no way to achieve determinism across cluster restarts/different clusters/different reads of the same data. I created the below example which more accurately reflects my use case (and I would think a reasonably standard use case) where caching is not an option (as we are pulling the data for each fit), and I still get non-deterministic results. But I guess if databricks reads/stores the data in a non-deterministic way there may be no way around this.
from synapse.ml.lightgbm import LightGBMRegressor
import numpy as np
import pandas as pd
from pyspark.ml.feature import VectorAssembler
# create data
np.random.seed(42)
num_features = 88
data_length = 3_000_000
data = pd.DataFrame({f'feature_{i}': np.random.random(data_length) for i in range(num_features)})
data['label'] = 0
for feature_num in range(num_features):
data[f'feature_{feature_num}'] *= np.where(np.random.random(data_length) < 0.5, 0, 1)
data['label'] += data[f'feature_{feature_num}'] ** feature_num
feature_names = [i for i in data if i != 'label']
# convert to trainable format and write data
train = spark.createDataFrame(data)
save_location = "/mnt/workspace/twofitmitch/non_determistic_example"
train.write.format("delta").save(save_location)
# pull data, fit a model on the data and then calculate its predicted value for the data
# if deterministic we would always expect this to give the same output for the same input
def fit_model_and_get_predictions():
train = spark.read.format("delta").load(save_location).orderBy(['label'] + feature_names)
featurizer = VectorAssembler(inputCols=feature_names, outputCol="features")
train_data = featurizer.transform(train)['label', "features"]
train_data = train_data.orderBy(['label', "features"])
model = LightGBMRegressor(
objective="regression", learningRate=0.1, numLeaves=30, deterministic=True, numIterations=200, seed=777,
)
model.passThroughArgs = "force_col_wise=True"
model = model.fit(train_data)
predictions = [i[0] for i in model.transform(train_data).select('prediction').collect()]
return predictions
# run the same code twice
prediction_one = fit_model_and_get_predictions()
prediction_two = fit_model_and_get_predictions()
# predictions are different, below should evaluate as False
# it may sometimes evaluate as True, as the non-determinism itself seems non-deterministic,
# but a rerun (or a few reruns) should return False. For me it is False about 3 out of 4 times
prediction_one == prediction_two