lettuce icon indicating copy to clipboard operation
lettuce copied to clipboard

SharedLock ThreadLocal threadWriters memory leak may cause CPU usage to reach 100%

Open huisman6 opened this issue 1 month ago • 7 comments

Bug Report

Current Behavior

We have a long-running Spring Boot application in production, deployed on k8s , and occasionally one or two Pods experience 100% CPU usage.

After Heap Dump, it was found that most of the CPU was consumed inThreadLocalMap.expungeStaleEntry:

http-nio-8080-exec-64
  at java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(I)I ()
  at java.lang.ThreadLocal$ThreadLocalMap.remove(Ljava/lang/ThreadLocal;)V ()
  at java.lang.ThreadLocal.remove(Ljava/lang/Thread;)V ()
  at java.lang.ThreadLocal.remove()V ()
  at org.springframework.context.i18n.LocaleContextHolder.resetLocaleContext()V (LocaleContextHolder.java:70)
  at org.springframework.web.filter.RequestContextFilter.resetContextHolders()V (RequestContextFilter.java:120)
  at org.springframework.web.filter.RequestContextFilter.doFilterInternal(Ljakarta/servlet/http/HttpServletRequest;Ljakarta/servlet/http/HttpServletResponse;Ljakarta/servlet/FilterChain;)V (RequestContextFilter.java:103)

After further analysis, it was found that there were 23,776 ThreadLocal variables in the Tomcat Thread threadLocals Entries, most of which were io.lettuce.core.protocol.SharedLock$$Lambda, corresponding to the source code threadWriters , and too many ThreadLocal variables may cause the cleanup of StaleEntry to consume more CPU .

private final ThreadLocal<Integer> threadWriters = ThreadLocal.withInitial(() -> 0);
Image

The value of the referent field for 20,827 ThreadLocalMap entries is io.lettuce.core.protocol.SharedLock threadWriters.

Image

There are also thousands of ThreadLocalMap entries where the referent is null, the value is java.lang.Integer = 0. These are likely from threadWriters that have already been garbage collected. The cleanup of these stale entries consumes a significant amount of CPU.

Image

Expected behavior/code

This code was introduced in version 6.4.0, and versions 6.4 and above are all affected by this ThreadLocal leak. The longer the process runs, the more likely it is to trigger 100% CPU usage.

Is it possible to modify threadWriters to static final to prevent ThreadLocal leaks?

private static final ThreadLocal<Integer> threadWriters = ThreadLocal.withInitial(() -> 0);

Environment

  1. SpringBoot 3.4.8
  2. Spring Data Redis 3.4.8
  3. Lettuce 6.4.2.RELEASE , Netty 4.1.123.Final
  4. JDK21 + Generational ZGC
  5. Redis Version: 6.0

Lettuce is default based on the connection pool mode and does not share TCP connections: LettuceConnectionFactory.setShareNativeConnection(false).

the connection pool is apache commons-pool2, with a maximum number of connections: 200, a minimum number of idle connections: 15 , peak active connections: 60, occasionally creating more TCP connections during a sudden surge of traffic, and closing them when idle.

huisman6 avatar Oct 27 '25 12:10 huisman6

Hey, @huisman6 I am reviewing the scenario you described. But in the meantime - is there a specific reason for you to use pooling ? We recommend using lettuce without pooling with sharing the native connection. That way you can get the most of it. Exception is if you need to use transactions. Check more here: https://redis.github.io/lettuce/advanced-usage/#is-connection-pooling-necessary

a-TODO-rov avatar Oct 28 '25 08:10 a-TODO-rov

Our application uses a Hybrid Cloud architecture, and the business has a high volume of requests to Redis. The peak Redis Command per second for a single application instance is around 5,000, and in most scenarios, mget is used, with an average of 300 keys per mget. We have observed that under high request volumes, using the shared native connection mode occasionally results in time-consuming spikes due to issues such as network packet loss or pipeline command or big keys causing Head Of Line, which affects application stability. Using the connection pool mode has better tolerance for "Head Of Line" and is relatively more stable.

huisman6 avatar Oct 29 '25 01:10 huisman6

@a-TODO-rov Our temporary solution is to downgrade the Lettuce version to 6.3.2.RELEASE. Is this version compatible with JDK 21? Are there any other feasible solutions?

huisman6 avatar Nov 12 '25 08:11 huisman6

Hey @huisman6 ,

Seems that this is a side-effect of #2905 and is specific to using a generational ZGC in combination with a connection pool under heavy load. For the time being using the old release would obviously help, but I plan to submit a solution that can help this case.

We can try and backport it to 6.x so that you do not have to make a major version update, would that be ok? Do you think you can also consume 7.x as this would save us some work?

Ideally you could pull a snapshot build and let us know if it resolves the problem, but I am not sure if this is feasible.

tishun avatar Dec 01 '25 16:12 tishun

@tishun

It would be great if it could be backported to the 6.4.x version.

We rely on Spring Boot for managing the Lettuce dependency version and ensuring compatibility. Spring Boot 3.4 corresponds to Lettuce version 6.4.2.RELEASE, and upgrading to a new version may cause compatibility issues with Spring Data Redis.

huisman6 avatar Dec 02 '25 08:12 huisman6

Unfortunately the team does not have currently the free resources to provide a solution.

Contributions are very welcome. Otherwise we will have to revisit it once the backlog permits.

tishun avatar Dec 03 '25 08:12 tishun

i'll try this issue 😄

bandalgomsu avatar Dec 04 '25 13:12 bandalgomsu