SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

Facing error while training with large dataset : std::bad_alloc

Open sundeepks opened this issue 3 years ago • 4 comments

Intermittently facing this issue of std::bad_alloc while training with the below configuration, can you please advise on the same

No of Executors : 30 No of Cores : 12, total 16 vcpu's Ram : 128 gb Executor Memory : tried with 48 gb / 64 gb Driver Memory : 64 gb

No of events : 250 Million No of Features : 2010 columns

Model Configuration : useSingleDatasetMode=True, numLeaves=512, featureFraction=0.8, numIterations=1024, useBarrierExecutionMode=True , validationIndicatorCol="validation" (0.4 Million records which can fit easily in driver memory) Version : com.microsoft.azure:synapseml_2.12:0.9.5

22/07/03 13:11:12 INFO MemoryStore: Block broadcast_130_piece105 stored as bytes in memory (estimated size 4.0 MiB, free 30.4 GiB) 22/07/03 13:11:12 INFO MemoryStore: Block broadcast_130_piece151 stored as bytes in memory (estimated size 4.0 MiB, free 30.3 GiB) 22/07/03 13:11:12 INFO TorrentBroadcast: Reading broadcast variable 130 took 2395 ms 22/07/03 13:11:16 INFO MemoryStore: Block broadcast_130 stored as values in memory (estimated size 1884.0 MiB, free 28.5 GiB) terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc

AB#1855963

sundeepks avatar Jul 03 '22 13:07 sundeepks

We are currently working on a big refactor of SynapseML LightGBM wrapper to better handle memory and large datasets. Memory is currently not well utilized and can cause OOM errors. The new "streaming" mode will use very little memory on top of what LightGBM itself requires. PRs are up, although a testable version might not be available for a while. We are coordinating changes in both LightGBM native library (not this team, see microsoft/LightGBM) as well as this SynapseML Scala Spark wrapper.

svotaw avatar Jul 09 '22 19:07 svotaw

@svotaw Great & would like to try this out as currently the memory allocated is more than twice the memory of the dataset for getting it trained.. any idea by when it would be available for trying it out ?

sundeepks avatar Jul 12 '22 10:07 sundeepks

It will be a few weeks. Much of that is due to the fact that we require changes to the LightGBM native SDK, so we are dependent on them to approve a pending PR with our changes (PR has been up for a month now). Once those changes are in, we can make the official check-in on our SynapseML side to take advantage of the new APIs. I will update here when I have a better timeline. The new refactor takes an order of magnitude less memory.

svotaw avatar Jul 12 '22 17:07 svotaw

thanks a lot, keep me posted once its available will be happy to test it out

sundeepks avatar Jul 18 '22 06:07 sundeepks