SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

LightGBM stuck at "reduce at LightGBMClassifier.scala:150"

Open OldDreamHunter opened this issue 3 years ago • 11 comments

I have already noticed the issue https://github.com/Azure/mmlspark/issues/542, but the answer cannot solve my problem.

I have a dataset nearly 72GB and 145 columns. My spark config is spark-submit
--master yarn
--deploy-mode client
--executor-memory 15g
--driver-memory 15g
--executor-cores 1
--num-executors 20
--packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1
--conf spark.default.parallelism=5000
--conf spark.sql.shuffle.partitions=5000
--conf spark.dynamicAllocation.enabled=false
--conf spark.memory.storageFraction=0.3
--conf spark.executor.memoryOverhead=15g
--conf spark.driver.maxResultSize=10g \

if I reduce the dataset size to 24 GB, I could train the model in 40 minutes. But if I increase the dataset to 72GB, the training process would be stuck at "reduce at LightGBMClassifier.scala:150" and report some failed information, "ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 128370 ms", "java.lang.Exception: Dataset create call failed in LightGBM with error: Socket recv error, code: 104", "java.net.ConnectException: Connection refused"

AB#1188553

OldDreamHunter avatar May 20 '21 02:05 OldDreamHunter

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

welcome[bot] avatar May 20 '21 02:05 welcome[bot]

hi @OldDreamHunter sorry about the trouble you are having. Have you tried increasing the socket timeout: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L47 What are the parameters to lightgbm?

imatiach-msft avatar May 24 '21 22:05 imatiach-msft

hi @OldDreamHunter sorry about the trouble you are having. Have you tried increasing the socket timeout: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L47 What are the parameters to lightgbm?

Thanks for your reply @imatiach-msft , I don't increase the socket timeout and would try it. And the parameters of my model as described below.

lgb = LightGBMClassifier( objective="binary", boostingType='gbdt', isUnbalance=True, featuresCol='features', labelCol='label', maxBin=64, earlyStoppingRound=100, learningRate=0.5, maxDepth=6, numLeaves=48, lambdaL1=0.8, lambdaL2=45.0, baggingFraction=0.7, featureFraction=0.7, numIterations=200)

OldDreamHunter avatar May 25 '21 08:05 OldDreamHunter

hi @OldDreamHunter sorry about the trouble you are having. Have you tried increasing the socket timeout: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L47 What are the parameters to lightgbm?

hi, @imatiach-msft, I have increased the timeout and changed the parallelism type to "voting_parallel", but the job still failed as "reduce at LightGBMBase.scala:230" with the failure reason of "Job aborted due to stage failure: Task 8 in stage 4.0 failed 4 times, most recent failure: Lost task 8.3 in stage 4.0 (TID 6027, pro-dchadoop-195-81, executor 22): java.net.ConnectException: Connection refused (Connection refused)"

boostingType='gbdt', isUnbalance=True, featuresCol='features', labelCol='label', maxBin=64, earlyStoppingRound=100, learningRate=0.5, maxDepth=5, numLeaves=32, lambdaL1=7.0, lambdaL2=7.0, baggingFraction=0.7, featureFraction=0.7, numIterations=200, parallelism='voting_parallel', timeout=120000.0)

OldDreamHunter avatar May 26 '21 01:05 OldDreamHunter

@OldDreamHunter I think that is a red herring, the real error is on one of the other nodes. Can you send all of the unique task error messages? Please ignore the connection refused error.

imatiach-msft avatar May 26 '21 04:05 imatiach-msft

you can also try to set useBarrierExecutionMode=True, I think it might give a better error message

imatiach-msft avatar May 26 '21 04:05 imatiach-msft

I would only use voting_parallel if you have a high number of features, see guide: https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html

image

imatiach-msft avatar May 26 '21 04:05 imatiach-msft

same problem. Everything will work well when I reduce the number of training data

icankeep avatar Jun 01 '21 09:06 icankeep

same problem. Voting Parallel works fine, but accuracy is very low. Much data is skipped.

Simon-LLong avatar Dec 30 '21 09:12 Simon-LLong

@Simon-LLong sorry about the problems you are encountering. Indeed Voting Parallel can give lower accuracy, but with much better speedup and lower memory usage.

Can you also please try the new mode: useSingleDatasetMode = True numThreads = num cores - 1 These two PRs should resolve this:

#1222 #1282

In performance testing we saw big speedup with new single dataset mode and numThreads set to num cores -1 (as well as lower memory usage). The two PRs above will be available in 0.9.5 or you can get them with the latest build right now. In 0.9.5 these params will be set by default, but in earlier versions like currently released 0.9.4 you can set them directly.

For more information on the new single dataset mode please see the PR description: #1066

This new mode was created after extensive internal benchmarking.

I have some ideas on how a streaming mode can also be added to distributed lightgbm, where data is streamed into the native histogram binned representation, which should use a small fraction of the total spark dataset when everything is loaded in memory. It might be a little slower to setup, but it should vastly reduce memory usage. This is something I will be looking into in the near-future.

imatiach-msft avatar Dec 30 '21 16:12 imatiach-msft

numThreads (int) – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores. Is this number of cores on my executor node, number of cores in my executor or number of cores on my cluster?

nitinmnsn avatar Feb 20 '22 05:02 nitinmnsn