SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

Errors Training on Large Dataset : LightGBM

Open BrianMiner opened this issue 2 years ago • 13 comments

Im trying to fit a classifier with LightGBM on a large dataset. There are about 900,000,000 rows and 40 columns, 7 of which are integers being treated as categorical.

  • SynapseML Version: com.microsoft.azure:synapseml_2.12:0.9.5
  • Spark Version 3,2
  • Spark Platform: AWS EMR

The current cluster is

6 workers with 16 VCPUS and 64GB RAM each.

The call to Lightgbm is as follows:

lgb_estimator = LightGBMClassifier(objective ="binary", learningRate = 0.1, numIterations = 222,
                                       categoricalSlotNames = ["cat1",
                                                                "cat2",
                                                                "cat3",
                                                                "cat4",
                                                                "cat5",
                                                                "cat6",
                                                                "cat7"], 
                                       numLeaves= 31,
                                       probabilityCol='probs',
                                       featuresCol='features',
                                       labelCol='target',
                                   useBarrierExecutionMode=True
                                  )
    
lgbmModel = lgb_estimator.fit(df_train)

I have had various errors running on a 50% sample but it completes with a much smaller sample. I switched to useBarrierExecutionMode = True which resulted in not enough space on disk errors so I increased the volume on all the workers. I get errors that dont seem to helpful:

An error occurred while calling o109.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(9, 3) finished unsuccessfully.
java.net.ConnectException: Connection refused (Connection refused)

Or when not using BarrierExecutionMode=True, something like:

An error occurred while calling o284.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most recent failure: Lost task 0.3 in stage 19.0 (TID 8696) (ip-10-0-188-112.ec2.internal executor 3): java.net.ConnectException: Connection refused (Connection refused)

My questions:

  1. Can this size of cluster support such a large dataset training with Lightgbm? My naïve thought was that it could but would be slow.
  2. When dealing with large datasets, any recommendations on how to set Spark properties that may help?
  3. Any suggestions on the cluster size needed to run this data?

AB#1833527

BrianMiner avatar Jun 16 '22 15:06 BrianMiner

@BrianMiner currently lightgbm loads all data into both java and native memory, but @svotaw is working on implementing a streaming mode which will allow lightgbm to stream the java data into a native binned representation which should be a fraction of the current memory usage. You can find his current PR to lightgbm native code here:

https://github.com/microsoft/LightGBM/pull/5291

and once that PR is merged there will be a follow-up PR in this SynapseML repository. We would appreciate any comments on this new feature to reduce memory usage and if you could do any testing on this new feature. Otherwise, the best solution for now is to just increase the cluster size or sample the data, or run in batches (you can specify the number of batches), but that will decrease accuracy. I would also recommend to use the latest build from master instead of 0.9.5 since there were several important bug fixes merged since then.

Maven Coordinates com.microsoft.azure:synapseml_2.12:0.9.5-121-3e23f023-SNAPSHOT

Maven Resolver https://mmlspark.azureedge.net/maven

imatiach-msft avatar Jun 16 '22 16:06 imatiach-msft

Also, it is actually possible that you are running into some other error than memory error. I would recommend to try the latest code. You can also find the error message by looking at the executor logs, there should be an error for each of the workers. The "Connection refused" error is a red herring. One of the workers will have an error message that should be more useful than the connection refused error.

imatiach-msft avatar Jun 16 '22 16:06 imatiach-msft

Ill try the new code. Does the memory usage method above suggest that currently a certain configuration ( cluster size and spark properties to allocate the resources) is needed for this data? Is that somehow knowable?

BrianMiner avatar Jun 16 '22 16:06 BrianMiner

I would know this much better if I could see the cluster logs. Note the build above is just latest master, it doesn't yet include the new optimizations. I wrote to @svotaw and he wrote that he will be able to send a PR by Monday, so sometime next week we can send you a build to try out with the new streaming optimizations to see if it helps prevent the error.

imatiach-msft avatar Jun 16 '22 17:06 imatiach-msft

I am a novice with Spark I am afraid, but I found some references to what sounds like a memory issue. This was trying to run a 50% sample on a new larger cluster:

10 workers with 32 vCore, 128 GiB memory, each.

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f4f3d9f6000, 12288, 0) failed; error='Cannot allocate memory' (errno=12) OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f4f446d3000, 12288, 0) failed; error='Cannot allocate memory' (errno=12) OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f4f5075c000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)

BrianMiner avatar Jun 16 '22 19:06 BrianMiner

If I bump up the executor memory to 64GB I can train 50% of the rows (450million). I noticed that this last step, if it has 4 tasks or less, it will crash. if the number of tasks is larger, it likely succeeeds

image

BrianMiner avatar Jun 16 '22 20:06 BrianMiner

Interesting, it might just be an issue with how much memory you have assigned to executors. Maybe you have little memory assigned to each executor, hence this would explain that if you have more tasks you can run on more data. I'm not sure how your cluster is configured, but from the UI it doesn't look like you are using a Databricks cluster or a Synapse cluster, which are the two that I am most familiar with.

imatiach-msft avatar Jun 17 '22 21:06 imatiach-msft

I had 64 to 128 GB assigned to executors during various tests. I am using EMR.

Looking forward to the new memory efficient version!

BrianMiner avatar Jun 18 '22 01:06 BrianMiner

@imatiach-msft just checking to see if this has been made available yet?

BrianMiner avatar Jun 28 '22 17:06 BrianMiner

@BrianMiner no, sorry. @svotaw is working on this, he was running into some seg faults that he just fixed at end of last week, and he is on summer vacation this week. I'm really hoping we can get a POC out soon for people to try out to get feedback, but @svotaw also still needs to do a lot of testing on the new streaming code he wrote.

imatiach-msft avatar Jul 05 '22 14:07 imatiach-msft

sorry, we are trying to coordinate PRs into both LightGBM repo and SynapseML repo. I will update this when there is something to test. The LightGBM PR has been up a couple of weeks, and part 1 of the SynapseML PR will go up today. It will likely still be a week or two before we can give something to test.

svotaw avatar Jul 08 '22 20:07 svotaw

It is a large set of changes to both LightGBM and SynapseML code, so I have been splitting it up to make reviewing easier. Unfortunatey this slows down progress checking in.

svotaw avatar Jul 09 '22 19:07 svotaw

@imatiach-msft @svotaw if I understand correctly, using native memory isn't counted by JVM at all and it may lead to OOM errors when running on some resource managers (Kubernetes, for instance) due to "memory overcommitting" of spark executors.

I mean during an executor creation via Kuberenetes, the pod's memory restriction and on JVM heap (-Xmx) are usually the same (except for some minor memory overhead added to the pod's memory restriction). If I'm not mistaken, in some moment JVM heap may grow up to the Xmx (while this executor memory is not necessarily being filled with actual data at the moment, but it is still allocated and counted by OS and the resource manager as consumed), executing lightgbm with 'UseSingleDatasetMode=True' after that would lead to a cumbersome OOM error on Kubernetes level.

Please, correct me if I'm wrong. But if I'm correct, could you please give some advice about how to handle such situations? May be specific settings of JVM can help here ?

fonhorst avatar Mar 06 '23 09:03 fonhorst