SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

Lightgbm : Inconsistency detected by ld.so

Open sundeepks opened this issue 2 years ago • 3 comments

Facing below intermittent error while training Lightgbm model

Inconsistency detected by ld.so: ../elf/dl-tls.c: 488: _dl_allocate_tls_init: Assertion listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!`

No of Executors : 30 No of Cores : 12, total 16 vcpu's Ram : 128 gb, Executor Memory : 64 gb Driver Memory : 64 gb

No of events : 250 Million No of Features : 2010 columns

Model Configuration : useSingleDatasetMode=True, numLeaves=512, featureFraction=0.8, numIterations=1024, useBarrierExecutionMode=True , validationIndicatorCol="validation" (0.4 Million records which can fit easily in driver memory) Version : com.microsoft.azure:synapseml_2.12:0.9.5

AB#1855966

sundeepks avatar Jul 03 '22 14:07 sundeepks

This is not something we recognize or are familiar with. A brief investigation shows that there are concurrency issues in glib, so maybe upgrade your version? https://sourceware.org/bugzilla/show_bug.cgi?id=19329

Note, we are currently working on a big refactor of SynapseML LightGBM wrapper to better handle memory and large datasets. PRs are up, although a testable version might not be available for a while. We are coordinating changes in both LightGBM native library (not this team, see microsoft/LightGBM) as well as this SynapseML Scala Spark wrapper.

If you'd like us to comment more, please add more context. That appears to be an error in the native layer, so possibly the native LightGBM library or a dependency?

svotaw avatar Jul 08 '22 20:07 svotaw

yes, it looks like a thread concurrency bug/race condition in glibc, which was fixed in 2021, so I wonder if upgrading might fix it:

related issue: https://github.com/puppeteer/puppeteer/issues/2207

same link as @svotaw pasted above: https://sourceware.org/bugzilla/show_bug.cgi?id=19329

or perhaps there's some change in native lightgbm code that could be made to avoid the race condition as well in the native calls. I wonder if there is some way for us to reproduce this bug? Are you able to see it on any smaller sample of the dataset?

imatiach-msft avatar Jul 08 '22 20:07 imatiach-msft

Currently I am using the amazon EMR with OS Amazon Linux release 2 (Karoo) & the output of ldd is the below, can you please let me know any other OS type / version with higher glibc version which is compatible with synapseml to get rid of this intermittent error ? Since it's intermittent it's not easy to reproduce the issue

ldd --version 
ldd (GNU libc) 2.26

sundeepks avatar Jul 12 '22 16:07 sundeepks