SynapseML why is the trainning speed of lightgbm on spark so slow compared with the local version?

How to improve the speed of lightgbm on spark?

Nov 29 '19 03:11 janelu9

+1 on this question.

My training data has 600k rows, and less than 100 columns. I tried running on Spark using mmlspark package v1.0 with 1 driver node, 1 executor node, 8 cores, and it takes significantly longer time for the spark version to run through 1 boosting iteration as compared to local lightGBM package. Performance per boosting iteration is much worse if I do multiple executor nodes.

Would love to learn the reason for this, and hopefully get some pointers to get around the issue.

Mar 24 '20 00:03 bobbychen2000

@imatiach-msft Do you mind helping explain this?

Also wondering if there's a way to use spark interface and trigger local train if it's indeed slow(because of say, network communications)? I've tried forcing partition to 1, and limiting numCoresPerExec, and doesn't seem to help.

Apr 09 '20 20:04 bobbychen2000

You can try to use the pandas_udf() and use the python version lightgbm to train. It is much faster and you can still do parallelization if you are doing hyperparameter tuning.

Apr 25 '20 04:04 louis925

+1 on this question.

My training data has 600k rows, and less than 100 columns. I tried running on Spark using mmlspark package v1.0 with 1 driver node, 1 executor node, 8 cores, and it takes significantly longer time for the spark version to run through 1 boosting iteration as compared to local lightGBM package. Performance per boosting iteration is much worse if I do multiple executor nodes.

Would love to learn the reason for this, and hopefully get some pointers to get around the issue.

I have the same problem!!!

Sep 22 '20 10:09 siyuzhiyue

@siyuzhiyue have you tried setting numTasks = num machines, so one task run on each machine? What is your cluster configuration and what are your lightgbm parameters? I've found for some large parameter values on lightgbm reducing numTasks to equal the total number of machines helps reduce memory load and sometimes improves performance. Not sure if it will help in your case though.

May 20 '21 03:05 imatiach-msft

hi @imatiach-msft my case is similar, when I use one executor, the train time is 5 mins, when I use two executors, the train time is more than 5 mins but very close, when I use four executors, the train time is about 15 mins. my environment is: spark: 3.2.0 synapseml: 0.9.4 instance type: m5.9xlarge (18 cores, 36 threads, 72G memory) my spark config is: --driver-memory 4g --executor-cores 34 --num-executors {executor_num} --executor-memory 50g lightgbm config is: from synapse.ml.lightgbm import LightGBMRegressor model = LightGBMRegressor(objective='quantile', labelCol='sold_rolling', alpha=0.2, learningRate=0.3, metric='quantile', featureFraction=0.5, numLeaves=2**10-1, minDataInLeaf=2**6-1, maxDepth=-1, maxBin=500, numIterations=1000, boostFromAverage=True, # verbosity=-1, verbosity=2, earlyStoppingRound=100, useSingleDatasetMode=False, # I have try set to true, but performance is worse # chunkSize=5000, numThreads=17, #real cpu core is 18, hyper threading is 36 numTasks={executor_num} ) and I see some logs, maybe it is helpful for you: `4 Nodes [LightGBM] [Info] Finished linking network in 737.569538 seconds

2 Nodes [LightGBM] [Info] Finished linking network in 109.627021 seconds`

maybe it is network communication time, I am not sure

Mar 11 '22 02:03 zhangpanfeng

I have tried synapseml 0.9.5, it is similar and even worse than 0.9.4 for my case

Mar 11 '22 02:03 zhangpanfeng

@zhangpanfeng how large is your dataset? Indeed, for very small datasets, the network communication overhead will actually make it slower to train, this is common knowledge. If you have a small dataset, it doesn't make sense to use spark or distributed compute since it will usually just make things slower due to the network communication overhead.

Mar 11 '22 14:03 imatiach-msft

@zhangpanfeng also your driver memory is really low, I would recommend a higher value, based on:

--driver-memory 4g --executor-cores 34 --num-executors {executor_num} --executor-memory 50g

Also this setting is a bit strange. You have 18 cores, 72G memory per machine, yet you are using 34 cores overall (which is just 2 machines). I think you can increase executor cores and decrease memory to run more tasks on your cluster. Does each task use 1 core? That would speed things up a lot since you would use a lot more cores.

Mar 11 '22 15:03 imatiach-msft

hi @imatiach-msft, Thanks you. but as you know the value of --executor-cores is v-cores not the real CPU cores, but numThreads of synapseml is real CPU cores. how to make sure they match each other. and how to set numTasks of synapseml?

Does each task use 1 core? That would speed things up a lot since you would use a lot more cores. The lightgbm official website recommends users set up real CPU cores -1 as the numThreads on a distributed environment, so does each task use 1 core mean I run 17 task per executor and set numThreads=1 to make sure each task use 1 core? or the 1 core is 1 v-core, I run 34 task per executor, how to set numThreads?

Mar 28 '22 13:03 zhangpanfeng

Checking in to see if anyone made any new discoveries? I recently started experimenting with this on Databricks but I see up to 20% of the CPUs are used at once even after tweaking numThreads and repartitioning training data...

Oct 25 '22 00:10 timmhuang

local 模式训练本机cpu会被拉满100%利用率就很快，分布式训练的话你要登录到每个slave节点去看cpu利用率就会发现利用率很低（spark3有所改善），不能只看spark history server上的并行度。分的越细通信开销也越大

Oct 31 '22 08:10 shuDaoNan9

SynapseML SynapseML copied to clipboard

why is the trainning speed of lightgbm on spark so slow compared with the local version?

SynapseML
SynapseML copied to clipboard