SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

Why it almost do not speedup with distributed learning?

Open shuDaoNan9 opened this issue 2 years ago • 5 comments

First, I tried 2 spark slaves, it take about 11 minutes to train my model. submit info: spark-submit --master yarn --num-executors 2 --executor-memory 19G --executor-cores 16 --conf spark.dynamicAllocation.enabled=false --jars s3://EMR/jars/synapseml-vw_2.12-0.9.4.jar,s3://EMR/jars/synapseml_2.12-0.9.4.jar,s3://EMR/jars/client-sdk-1.14.0.jar ...... 图片

Second, I tried only one spark slave, it take about 12 minutes to train my model. submit info: spark-submit --master yarn --num-executors 1 --executor-memory 19G --executor-cores 16 --conf spark.dynamicAllocation.enabled=false --jars s3://EMR/jars/synapseml-vw_2.12-0.9.4.jar,s3://EMR/jars/synapseml_2.12-0.9.4.jar,s3://EMR/jars/client-sdk-1.14.0.jar ...... 图片

My results show that LightGBM do not speedup with distributed learning. And my CPU Utilization could be more than 95% on each spark slave! Why it almost do not speedup with distributed learning?

My cluster/data/code Info: spark slave: 16 vCore, 32 GiB * 2; spark version: Spark 3.1.2, Hive 3.1.2, ZooKeeper 3.5.7; dependency: synapseml_2.12-0.9.4.jar; train data set: 5377937 rows; code:

val classifier = new LightGBMClassifier() 
  .setLabelCol("play")
  .setObjective("binary")
  .setCategoricalSlotNames(Array("countrycode_index","itemID_index","uid_index"))
  .setUseBarrierExecutionMode(true) 
  .setFeaturesCol("gbdtFeature")
  .setPredictionCol("predictPlay")
  .setNumIterations(trees) 
  .setNumLeaves(32)
  .setLearningRate(0.006) 
  .setProbabilityCol("probabilitys")
  .setEarlyStoppingRound(200)
  .setBoostingType("gbdt")
  .setLambdaL2(0.002)
  .setMaxDepth(24)
val lgbmModel =classifier.fit(lgbmTrainDF.repartition(repartNum)) // repartNum='the number of spark slaves '

Thanks in advance!

AB#1984488

shuDaoNan9 avatar Dec 17 '21 08:12 shuDaoNan9

hi @JWenBin can you please try: useSingleDatasetMode = True numThreads = num cores - 1 These two PRs should resolve this:

https://github.com/microsoft/SynapseML/pull/1222 https://github.com/microsoft/SynapseML/pull/1282

In performance testing we saw big speedup with new single dataset mode and numThreads set to num cores -1. The two PRs above will be available in 0.9.5 or you can get them with the latest build right now.

For more information on the new single dataset mode please see the PR description: https://github.com/microsoft/SynapseML/pull/1066

This new mode was created after extensive internal benchmarking.

imatiach-msft avatar Dec 27 '21 16:12 imatiach-msft

hi @JWenBin can you please try: useSingleDatasetMode = True numThreads = num cores - 1 These two PRs should resolve this:

#1222 #1282

In performance testing we saw big speedup with new single dataset mode and numThreads set to num cores -1. The two PRs above will be available in 0.9.5 or you can get them with the latest build right now.

For more information on the new single dataset mode please see the PR description: #1066

This new mode was created after extensive internal benchmarking.

Thank you for your Reply! I tried that just now, the speed improved a lot, but AUC and accuracy becomes too low (less than 0.6). It looks like if I use 'setUseSingleDatasetMode(true)', I should change my params at the same time.

shuDaoNan9 avatar Jan 05 '22 08:01 shuDaoNan9

hi @JWenBin "the speed improved a lot, but AUC and accuracy becomes too low (less than 0.6)" That is very interesting. In our benchmarking this didn't affect accuracy at all. It only affected memory usage and execution time. I wonder why this might be causing AUC and accuracy to get worse. Essentially we are just reducing the number of datasets being run in parallel, and using more multithreading within each machine while reducing inter-process communication.

imatiach-msft avatar Jan 05 '22 14:01 imatiach-msft

hi @JWenBin "the speed improved a lot, but AUC and accuracy becomes too low (less than 0.6)" That is very interesting. In our benchmarking this didn't affect accuracy at all. It only affected memory usage and execution time. I wonder why this might be causing AUC and accuracy to get worse. Essentially we are just reducing the number of datasets being run in parallel, and using more multithreading within each machine while reducing inter-process communication.

I deleted 'setUseBarrierExecutionMode(true)' while using ‘setUseSingleDatasetMode(true)’ and retrain the model again. My AUC returned to normal level. But I still don't know how 'setUseBarrierExecutionMode(true)' affect ‘setUseSingleDatasetMode(true)’ while trainning. And I find that vector features from 'Word2VecModel' may make AUC worse while setting 'setUseSingleDatasetMode(true)'. I still don't know how 'setUseSingleDatasetMode(true)' affect vector features while trainning too. Thank you very much!

shuDaoNan9 avatar Jan 06 '22 08:01 shuDaoNan9

Every input data is about 48MB±2MB in each task while trainning('setUseBarrierExecutionMode(true)'), But spark history server indicate only 575989/26320507 rows were trained enough time. 图片

shuDaoNan9 avatar Jan 12 '22 06:01 shuDaoNan9