Symbol and page pool leak memory when used in a multi-threaded context

Open sadderchris opened this issue 4 years ago • 1 comments

ion-c used to allocate a global symbol table that's used to pool/cache global symbols. This was later put into thread-local storage (TLS), presumably because of thread safety concerns, and this was a cheap/minimally invasive mitigation until further work could be done to make things properly thread safe. Since the symbol pool now lives in TLS, each thread gets its own (global) symbol pool. When the symbol pool is initialized, allocations for the (growable) array backing the symbol pool are handed out from the heap (i.e. not from TLS itself - which wouldn't really make sense anyways since TLS is limited in size). Since the symbol pool is treated as a global resource, there isn't any code to free this backing heap storage - it's assumed to have the same lifetime as the program, which is a fine assumption if it truly were a global resource, but isn't correct given that it is now thread local.

When a thread terminates, it's TLS gets free'd, but nothing is cleaning up the heap-allocated parts of the global symbol pool (because there isn't anything that frees the heap-allocated parts of the symbol pool - it was originally implemented as a global resource). So whenever a thread dies or is reinitialized, we effectively leak its copy of the global symbol pool.

Something similar seems to be happening for the page pool, as I have observed leaks that go away when the page pool is disabled (i.e. the default page limit is set to 0) and it looks like a pointer to the page pool is also stored in TLS. Overall, about ~1.1 MB are being leaked per thread. With lots of threads or a thread pool that recycles threads and re-initializes TLS though, this adds up pretty quickly.

Dec 07 '21 23:12 sadderchris

Something we should consider strongly here is eliminating the page pool entirely--and I do mean delete the code and make sure it cannot be used. The rationale is this--we keep putting mitigations into this code that was never designed to be thread-safe, the use of TLS here was mostly to work around the fact that we couldn't get at all of the places where the page pool allocates even with limit of zero.

In my opinion, instead of retrofitting this code which has been proven to be very prone to bugs, we should consider a re-write in which the pool is explicit in the API (optional and default to NULL meaning no pooling). By explicit, I mean that the pool is explicitly managed by the application and we are explicit about the thread-safety or non-safety of that pool. If the pool is explicit, this is good for the application because it can choose to or not use a concurrent safe version if we so desire. It also means that we don't run afoul of any interesting threading concerns that the host application has.

Dec 15 '21 17:12 almann