ARM GLIBC pre-2.32 TLS issue
Currently the aarch64 builds require __glibc >=2.32 to prevent TLS allocation error when importing; is there any way we could modify the TLS model of its shared object to avoid this?
Filing this issue to track
ref: https://bugzilla.redhat.com/show_bug.cgi?id=1722181
I think that constraint is way too aggressive... Filed an issue on the feedstock to discuss.
Opened https://github.com/dask-contrib/dask-sql/pull/1201 to remove the constraint and reorder imports to avoid the error when importing
Thanks Charles! 🙏
Would keep this issue open (after that is merged) since it sounds like we are still working around Rust behavior in the end
We can observe the same issue in dask_sql == 2023.11.0.
Environment:
- Architecture: aarch64 (AWS Graviton2, neoverse-n1)
- Linux Distro: Amazon Linux 2
- Linux Kernel: 4.14
- glibc: 2.26
- Python: 3.9.16 (CPython implementation)
The issue happens if we import ray before dask_sql. If we import dask_sql the first - everything works fine.
As it's the managed/serverless environment (AWS Glue for Ray), we can't update glibc and Linux on our end (yet, it would be resolved on anticipated update to Amazon Linux 2023 by vendor).
@jakirkham thank you and your team for the hard work!
Nope, older versions also exhibit the same behavior.
We dived deeper into the problem + (spolier) found one more universal solution which works for ones using glibc <2.32. Our use case is further complicated as we exhaust TLS much faster - Glue-managed dependencies, Ray-managed dependencies, and a plethora of our own compiled packages (GDAL, GeoPandas, etc.) w/ reliance on TLS. So when we attempt to import dask_sql, TLS is already overused. If we import dask_sql prior ray - it seems everything is fine, but some other libraries (especially ones which depend on on OpenMP - which also uses TLS a lot) start failing.
After some deep dive, we found that it's caused by custom memory allocator, pretty often used by Rust packages for performance. It affects both jemalloc, mimalloc, and many other implementations.
In case of dask_sql, it's a mimalloc. It's a transitive dependency:
- "dask_sql" Python package depends on "datafusion-python" Rust crate: https://github.com/dask-contrib/dask-sql/blob/5deb35d52698682c7f08d06b1e6dbdf194e5f62e/Cargo.toml#L14
- "datafusion-python" Rust crate depends on "mimalloc" Rust crate: https://github.com/apache/arrow-datafusion-python/blob/da6c183ebb673b27808ff35b1b9fd8d577a203c0/Cargo.toml#L49C4-L49C4
- It's an optional dependency but it's always installed as added into datafusion's "default" feature: https://github.com/apache/arrow-datafusion-python/blob/da6c183ebb673b27808ff35b1b9fd8d577a203c0/Cargo.toml#L32C2-L32C2
- Cool part is that "mimalloc" crate's team has also faced such issue in past, so they created an opt-in (why it's opt-in? read below) fix: https://github.com/purpleprotocol/mimalloc_rust/pull/38
- To use this fix, we have to enable "local_dynamic_tls" feature of "mimalloc" crate: https://github.com/purpleprotocol/mimalloc_rust/blob/0dc380da37b09c5e11aa6b57022dfdc156d61309/Cargo.toml#L32
I'm not the big expert in Cargo or maturin, so for our tests, I made a pretty dirty fix in the following way:
- I forked the dask_sql and added "mimalloc" as its explicit depenency w/ the "local_dynamic_tls" enabled: https://github.com/gorloffslava/dask-sql/blob/f3077eccf104cf0767b31d6270871cdf7b44d230/Cargo.toml#L19
- Cargo.lock should be regenerated for that to work. (In my dirty fork, I just modified pyproject.toml/tool.maturin.locked to false - as said, I'm not the best expert in Rust-related tools ...)
- And now it works despite imports order!
It's an arguable question, should "dask_sql" enable that by default. The reason is that while it fixes the problem, it may degrade performance for all glibc users, regardless of its version and architecture (x86 vs arm). However, I don't have numbers to tell how much this degradation (if any) would be on different hardware. On AWS Graviton2, we see ~0% degradation on some sample workloads. Yet, on larger datasets it may start to appear. Probably, such fix may be added as a feature flag.
P.S., as an alternative, dask_sql can require datafusion crate w/ mimalloc disabled, but this would likely degrade performance a lot.
P.P.S., if you would need any extra info from us per this bug - please, let me know and I'll help as much as I can.
Thanks @gorloffslava ! 🙏
That added detail helps. So maybe we got lucky before when things happened to work (or didn't and we just weren't aware of the lurking issue)
Will discuss more with the team and see what we can come up with
JFYI folks will be out for the next few days due to holidays. So this probably won't get picked up until next week. Sorry about that (though I hope that info is helpful)
@jakirkham no worries! We're in the middle development phase, so can use our dirty fork meanwhile.
P.S., Happy Thanksgiving!
Thanks taking the time to look into this issue @gorloffslava, and providing a few potential solutions 😄
Sounds like it makes most sense to target arrow-datafusion-python with modifications since that's ultimately where the mimalloc dependency comes from, but for now I've opened https://github.com/dask-contrib/dask-sql/pull/1274 to test out methods to conditionally set the local_dynamic_tls feature for builds targetting Linux ARM
@charlesbluca any update on this?