opentelemetry-java Memory leak as a result of no cleanup for ThreadLocal in CodedOutputStream??

Discussed in https://github.com/open-telemetry/opentelemetry-java/discussions/7082

^{Originally posted by asclark109 February 7, 2025} I am using the io.opentelemetry:opentelemetry-exporter-common:1.38.0 jar in my java web application project running on Tomcat 10. I am getting memory leaks at application shutdown (one is a io.netty.util.internal.InternalThreadLocalMap that is tracked in Netty). The other appears below.

07-Feb-2025 12:36:08.735 SEVERE [main] org.apache.catalina.loader.WebappClassLoaderBase.checkThreadLocalMapForLeaks The web application [agtest-trunk] created a ThreadLocal with key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@658c5a19]) and a value of type [io.opentelemetry.exporter.internal.marshal.CodedOutputStream.OutputStreamEncoder] (value [io.opentelemetry.exporter.internal.marshal.CodedOutputStream$OutputStreamEncoder@421e361]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.

I have looked at your class in release 1.38.0 and on main: CodedOutputStream.java

I notice that a ThreadLocal is created and updated but never cleaned up (i.e. there is no call to do THREAD_LOCAL_CODED_OUTPUT_STREAM.remove()).

https://github.com/open-telemetry/opentelemetry-java/blob/30d16eb3d393f85d99175a62ed246e369e25b906/exporters/common/src/main/java/io/opentelemetry/exporter/internal/marshal/CodedOutputStream.java#L85-L104

If someone can offer help to get around this (or patch a fix), it would be appreciated. thanks.

Only discussion page I could find on ThreadLocals https://github.com/open-telemetry/opentelemetry-java/discussions/6584

Feb 07 '25 18:02 asclark109

Looking at this class, one of the tricky things we need to make sure in order to fix this class is that the thread that created the object and its threadlocal variable needs to also be the thread that cleans up the threadlocal variable.

That is, suppose you create an expose a new method on the class to allow proper cleanup

public void cleanup()
{
  THREAD_LOCAL_CODED_OUTPUT_STREAM.remove()
}

If a thread1 creates and uses an eos EncodedOutputStream, it creates a threadlocal variable on thread1. If a thread2 comes along and invokes eos.cleanup(), this will NOT actually cleanup the threadlocal variable on thread1 because that variable only exists on thread1--not thread2. That's important to point out I think.

In summary, thread1 has to do the cleanup for the EncodedOutputStream that was created and used within thread1; other threads cannot cleanup the threadlocal variable created on other threads.

Feb 07 '25 19:02 asclark109

Looking at the code, it appears both CodedOutputStream and ProtoSerializer use threadlocals without cleanup. Therefore the area of interest is:

Marshaler.writeBinaryTo(). Once this is called, you have (technically 2) memory leaks because this method creates a ProtoSerializer (which sets the threadLocal in that class, which doesn't have any cleanup), and this also creates a CodedOutputStream (which sets the threadlocal in that class, which doesn't have any cleanup).

Marshaler.writeBinaryTo() is the only place where a ProtoSerializer is created, and the ProtoSerializer is the only place the CodedOutputStream is created.

The simplest fix is to make use of the fact ProtoSerializer implements AutoCloseable.

add the method CodedOutputStream.cleanup(), which does:

public void cleanup()
{
  THREAD_LOCAL_CODED_OUTPUT_STREAM.remove()
}

in ProtoSerializer.close(), invoke output.cleanup(). This will fix memory leak in CodedOutputStream. If you also want to remove the leak in ProtoSerializer (which I saw via inspection but didn't detect yet), you also invoke idCache.remove(). This will effectively remove your caching mechanism until you change the implementation to not use threadlocals (i.e. a different kind of caching) or to use threadLocals very carefully.

  @Override
  public void close() throws IOException {
    try {
      output.flush();
      output.cleanup(); // NEW: fixes memory leak 1 by removing threadlocal from thread when "done"
      idCache.clear(); // see my next github comment about removing the threadlocal in ProtoSerializer class
    } catch (IOException e) {
      // If close is called automatically as part of try-with-resources, it's possible that
      // output.flush() will throw the same exception. Re-throwing the same exception in a finally
      // block triggers an IllegalArgumentException indicating illegal self suppression. To avoid
      // this, we wrap the exception so a different instance is thrown.
      throw new IOException(e);
    }
  }

Feb 10 '25 17:02 asclark109

Here's a way to replace the THREAD_LOCAL_ID_CACHE in ProtoSerializer. Use a "global" ConcurrentHashMap to store id conversions. Pros:

no threadlocals (avoid class loader memory leaks) therefore does not pollute threads
works well in multi-threaded environments
has an eviction policy to prevent unbouunded growth

import java.util.concurrent.ConcurrentHashMap;

private static final int CACHE_MAX_SIZE = 10_000;  // Adjust as needed
private static final ConcurrentHashMap<String, byte[]> GLOBAL_ID_CACHE = new ConcurrentHashMap<>();

private static byte[] getCachedOrCompute(String id, int length) {
    if (GLOBAL_ID_CACHE.size() > CACHE_MAX_SIZE) {
        GLOBAL_ID_CACHE.clear();
    }

    return GLOBAL_ID_CACHE.computeIfAbsent(id, key -> 
        OtelEncodingUtils.bytesFromBase16(key, length)
    );
}

Feb 10 '25 19:02 asclark109

I am getting memory leaks at application shutdown (one is a io.netty.util.internal.InternalThreadLocalMap that is tracked in Netty). The other appears below.

Yes I think this is indeed a (small) memory leak. OTLP exporters initialize with a thread pool, which is shutdown upon close. But the thread local variables are not cleaned up. But I'm curious how you came across this? The number of threads is small and the memory allocated to this thread local is small. I would think you would need to start / shut down many SDK exporter instances before this problem becomes noticeable. Is this the case?

(BTW, I do think we should fix this. Just curious how you managed to find it 🙂)

Feb 18 '25 15:02 jack-berg

I think it's fairly common to use ThreadLocal for this kind of optimization, I suspect the issue here is that the object stored in the ThreadLocal is not a base java.* class (and so not in the bootstrap class loader), and so it prevents unloading of the class loader that our library is in

Feb 18 '25 17:02 trask

I am getting memory leaks at application shutdown (one is a io.netty.util.internal.InternalThreadLocalMap that is tracked in Netty). The other appears below.

Yes I think this is indeed a (small) memory leak. OTLP exporters initialize with a thread pool, which is shutdown upon close. But the thread local variables are not cleaned up. But I'm curious how you came across this? The number of threads is small and the memory allocated to this thread local is small. I would think you would need to start / shut down many SDK exporter instances before this problem becomes noticeable. Is this the case?

(BTW, I do think we should fix this. Just curious how you managed to find it 🙂)

Yes...so, the TLDR is that the specific application I work on is a spring-based java web application, and the configuration for the web application becomes available as a Spring bean early in the spring life cycle. That configuration mentions things like "ProductName", which is necessary in order to create the underlying opentelemetry Resource object. This is somewhat limiting for observability because it would mean you can't observe any execution of statically defined code nor observe spring activities prior to the configuration bean becoming available.

So, what I'm doing is creating a "fake" resource and opentelemetry object, using that to observe startup activities, emitting the observed data into an in-memory "catcher", and then when the configuration bean becomes available, I rebuild the opentelemetry object with the new product details and export the data that was caught during startup. I am "rebuilding" the Opentelemetry object because its immutable, and so to achieve what I want it seems I need to create a new OpenTelemetry object and shut down the old one.

So, at this point, yes, I am destroying 3 ish "real" exporters (that export to an inmemory data structure) and creating 3 new ones (that export to a real external location).

Feb 18 '25 19:02 asclark109

Even if you don't do what I do above (i.e. you only create the exporters once for the entire life-cycle of the application), you may still be at risk of the memory leak if the thread that executes the java code is not owned by the application.

For example, if the java code is executed by a thread owned by the long-running Tomcat application server, then it creates a threadlocal on the application server's thread, and you've created the memory leak because that's a thread that lives beyond the lifecycle of the application. If the code is executed by an ExecutorService created by the application (with its own threads), then I think if you issue a proper shutdown call then you won't get a leak because the shutdown call also destroys the threads. I think.

Feb 18 '25 19:02 asclark109

It sounds like teh question to ask is:

"Is the ThreadPool we are creating re-using existing threads, or is it creating its own brand new threads?". If it's the first case then you get a leak if you don't clean up the threads. If you're creating new threads, then all you need to do is shutdown the threadpool properly.

Feb 18 '25 19:02 asclark109

After a decent amount of work I was able to reproduce this.

ThreadLocals are released when threads stop, and so as long as all the threads that use CodedOutputStream are stopped, then the ThreadLocals are released and GC'd. The OTLP exporters all create and manage their own thread pools and clean them up upon shutdown, so what gives?

It turns out that BatchSpanProcessor is the problem! There's a race condition where BatchSpanProcessor#shutdown()'s CompletableResultCode returns and resolves before its worker thread is shut down and (possibly?) before its exporter shutdown resolves. When I fixed this (in a scrappy way) locally, the error went away.

Will work on getting a proper fix out.

Apr 09 '25 15:04 jack-berg

I am indeed using the batch span processor. When you commit the changes to your main branch, can you comment here? I'll see if patching my local jars with those changes will the solve problem for me too.

Apr 09 '25 18:04 asclark109

Your analysis sounds right. If you shutdown the top level thing that spawned the threads (ExecutorService, threadpool, etc), and that thing isn't re-using existing threads (such as from an application server like tomcat), then it is okay technically if those threads used threadlocals. This is because the shutdown of the top-level thing will allow the Threads spawned by it to be stopped, which allows the GC of the threadlocal storage, and thus the references to the class loader that loaded your application classes (so on)--allowing everything to be GC'd.

But! if you forget to shutdown the top level constructs (ExecutorService, ThreadPools) AND the spawned threads DID use threadlocal storage, then you're hit with a bigger problem: every web application restart will result in a class loader memory leak preventing a lot from being GC'd. A single reference to any application class in ThreadLocal storage will prevent the GC of everything loaded by that class loader.

In other words, using Threadlocals stabs you in the back in this case because it takes something very possible (forgetting to explicitly shutdown an ExecutorService on application restart) and creates a real problem out of it (memory leak).

Apr 09 '25 18:04 asclark109

(I've written a long block below of something I thought was causing a leak, but now I am not entirely sure it was a cause for a leak).

I am beginning to suspect my specific memory leak has to do with the following below, but it surfaces an important issue. So, because of the limitations of the OpenTelemetry object being an immutable create-once-at-application-start type object, I have a limitation whereby in order to observe before and during spring startup, I create an OpenTelemetry object with minimal configuration, catching all the collected data in memory (I have in memory data structures that catch the emitted data). When spring is fully up, and beans that define configuration information are available, I then have enough information to create a new OpenTelemetry object (with all legit application-level configuration information). I create a new OpenTelemetry and shutdown the old one. With all the data caught during startup, I re-export that data using the new exporters injected into the new OpenTelemetry object. This leads to a memory leak described below:

So when I construct an OpenTelemetry object, I am injecting an exporter but in addition asking that exporter to do an export call with the caught opentelemetry data during startup. I create the OpenTelemetry exporters (and the overall OpenTelemetry object) in an application thread, and on the same application thread I ask the exporter(s) to export that data for me that was caught during startup. This is the key detail: because I am invoking a method on your exporter in an application thread, your exporter creates a thread local on my application thread, which pollutes the thread of my application. The reason the exporter works fine in your BatchSpanProcessor is because all of its usage is contained in a thread you created, which is eventually gc'd. However, if any developer (in their application thread) invokes a method on your classes that uses underlying threadlocals, you will end up putting threadlocals on their threads (unless you internally invoke that logic on a new thread you create).

Although it is dirty, one workaround I have in mind (trying it now) is to have my direct invocation of export (which happens approximately once in an application lifecycle) to be on a new ExecutorService I create and immediately shutdown (to make sure I do not directly invoke methods on your classes that use threadlocal storage on the application thread that might be owned by tomcat/websphere/so on). In summary, I think this hints that any classes you have that use threadlocal storage should remain as hidden as possible from external users--so you can control invocations of those methods that create threadlocals to be only on threads you create, or you refactor to incorporate clean up of threadlocals (so it becomes always safe to invoke the methods of the classes from any thread). I'm spitballing out loud though

Apr 15 '25 17:04 asclark109

random followup: making some changes above, I get fewer class loader leak warnings from tomcat at shutdown, but i still get 1 leak. If I take a heap dump right at application shutdown, it appears the threadlocal is held on the BatchSpanProcessor_WorkerThread-1. I am fairly certain I am shutting it down via the top level OpenTelemetry shutdown call. Will look more closely still

Apr 15 '25 20:04 asclark109

feels like it could have to do with BatchSpanProcessor.shutdown() merely returning worker.shutdown() (and not also waiting for the worker thread to complete)?

Apr 15 '25 21:04 asclark109

After digging at this a bit and fooling myself many times, I am convinced there is some kind of delayed cleanup for BatchSpanProcessor (e.g. it is resolving has having its shutdown complete when it is not actually complete). In the end, I found it was much less complicated to simply remove the use of threadlocals in CodedOutputStream (instead of changing shutdown behavior) by introducing an "EncoderCache" to simulate the same functionality as a ThreadLocal: each thread uses its thread_id to look up its respective CodedOutputStream in a Cache that has an eviction policy. I.e. I put this into CodedOutputStream:

private static class EncoderCache {
    private final int maxSize;
    private final Map<Long, OutputStreamEncoder> cache = new ConcurrentHashMap<>();
    private final ConcurrentLinkedQueue<Long> accessOrder = new ConcurrentLinkedQueue<>();
    private final Object lock = new Object();

    public EncoderCache(int maxSize) {
      this.maxSize = maxSize;
    }

    public OutputStreamEncoder getOrCreate(OutputStream output) {
      // Attempt to fetch existing encoder
      Long key = Thread.currentThread().getId()
      OutputStreamEncoder cos = cache.get(key);
      if (cos == null) {
        cos = new OutputStreamEncoder(output);
        cache.put(key, cos);
        accessOrder.offer(key);
        synchronized (lock)
        {
          evictIfNecessary();
        }
      } else {
        cos.reset(output);
      }
      return cos;
    }

    private void evictIfNecessary() {
      // Evict the least recently used entry if cache size exceeds maxSize
      while (cache.size() > maxSize) {
        Long oldKey = accessOrder.poll();
        if (oldKey != null) {
          cache.remove(oldKey);
        }
      }
    }
  }

Then at top of the class:

private static final EncoderCache encoderCache = new EncoderCache(30);

Refactor this method to have this impl:

  static CodedOutputStream newInstance(final OutputStream output) {
    return encoderCache.getOrCreate(output);
  }

I think with the use of synchronized, I can just use normal data structures. can remove the synchronized block (keep concurrent datastructures) for slightly less correct behavior (e.g. potential race conditions of 2 evictions at once), where that incorrect behavior does not cause anything bad.

Apr 18 '25 12:04 asclark109

But! if you forget to shutdown the top level constructs (ExecutorService, ThreadPools) AND the spawned threads DID use threadlocal storage, then you're hit with a bigger problem: every web application restart will result in a class loader memory leak preventing a lot from being GC'd. A single reference to any application class in ThreadLocal storage will prevent the GC of everything loaded by that class loader.

Yes we need to clean up resources.

In other words, using Threadlocals stabs you in the back in this case because it takes something very possible (forgetting to explicitly shutdown an ExecutorService on application restart) and creates a real problem out of it (memory leak).

There's an argument that the real problem is isn't the use of thread locals, but in the paradigm of a web application server where you try to maintain an extremely long lived JVM with isolated web applications which are added / removed / restarted. In the era of containers this starts to seems increasingly like a solution looking for a problem.

I still don't love the idea of these thread locals, or that they can't easily be accessed and cleaned when an exporter is shutdown, but there's no super clean way to get rid of them. Also, even if we did, we use thread local in other places and we'd need to do an analysis to make sure those are cleaned up as well.

So when I construct an OpenTelemetry object, I am injecting an exporter but in addition asking that exporter to do an export call with the caught opentelemetry data during startup. I create the OpenTelemetry exporters (and the overall OpenTelemetry object) in an application thread, and on the same application thread I ask the exporter(s) to export that data for me that was caught during startup. This is the key detail: because I am invoking a method on your exporter in an application thread, your exporter creates a thread local on my application thread, which pollutes the thread of my application. The reason the exporter works fine in your BatchSpanProcessor is because all of its usage is contained in a thread you created, which is eventually gc'd. However, if any developer (in their application thread) invokes a method on your classes that uses underlying threadlocals, you will end up putting threadlocals on their threads (unless you internally invoke that logic on a new thread you create).

Yes this is correct. Things are safe as long as you use BatchSpanProcessor, BatchLogRecordProcessor, PeriodicMetricReader, which start and manage background threads which are cleaned up when closed. But if you directly call export on an application thread, either through your own code or through something like SimpleSpanProcessor or SimpleLogRecordProcessor, then the application threads are polluted with the thread locals.

feels like it could have to do with BatchSpanProcessor.shutdown() merely returning worker.shutdown() (and not also waiting for the worker thread to complete)?

Yes this is the case. My local fix that I referenced before is basically just synchronously waiting for the batch span processor worker thread to stop before shutdown returns. Right now, the worker thread eventually shuts down, but it can occur after a non-trivial delay from when shutdown returns.

In the end, I found it was much less complicated to simply remove the use of threadlocals in CodedOutputStream

I'm open to this idea, but as I mentioned, there are other usages of ThreadLocal, for example in ProtoSerializer. Did you run into issues with this thread local usage after implementing your fix?

Apr 21 '25 21:04 jack-berg