SynapseML
SynapseML copied to clipboard
Lightgbm : Inconsistency detected by ld.so
Facing below intermittent error while training Lightgbm model
Inconsistency detected by ld.so: ../elf/dl-tls.c: 488: _dl_allocate_tls_init: Assertion
listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!`
No of Executors : 30 No of Cores : 12, total 16 vcpu's Ram : 128 gb, Executor Memory : 64 gb Driver Memory : 64 gb
No of events : 250 Million No of Features : 2010 columns
Model Configuration : useSingleDatasetMode=True, numLeaves=512, featureFraction=0.8, numIterations=1024, useBarrierExecutionMode=True , validationIndicatorCol="validation" (0.4 Million records which can fit easily in driver memory) Version : com.microsoft.azure:synapseml_2.12:0.9.5
AB#1855966
This is not something we recognize or are familiar with. A brief investigation shows that there are concurrency issues in glib, so maybe upgrade your version? https://sourceware.org/bugzilla/show_bug.cgi?id=19329
Note, we are currently working on a big refactor of SynapseML LightGBM wrapper to better handle memory and large datasets. PRs are up, although a testable version might not be available for a while. We are coordinating changes in both LightGBM native library (not this team, see microsoft/LightGBM) as well as this SynapseML Scala Spark wrapper.
If you'd like us to comment more, please add more context. That appears to be an error in the native layer, so possibly the native LightGBM library or a dependency?
yes, it looks like a thread concurrency bug/race condition in glibc, which was fixed in 2021, so I wonder if upgrading might fix it:
related issue: https://github.com/puppeteer/puppeteer/issues/2207
same link as @svotaw pasted above: https://sourceware.org/bugzilla/show_bug.cgi?id=19329
or perhaps there's some change in native lightgbm code that could be made to avoid the race condition as well in the native calls. I wonder if there is some way for us to reproduce this bug? Are you able to see it on any smaller sample of the dataset?
Currently I am using the amazon EMR with OS Amazon Linux release 2 (Karoo) & the output of ldd is the below, can you please let me know any other OS type / version with higher glibc version which is compatible with synapseml to get rid of this intermittent error ? Since it's intermittent it's not easy to reproduce the issue
ldd --version
ldd (GNU libc) 2.26