bazel-buildfarm
bazel-buildfarm copied to clipboard
Sharded workers can get stuck on Pool exhaustion
A few of our Sharded workers can get stuck after the message Startup Time: 1s when Redis instances are restarting or getting migrated and when the Sharded worker process is starting up. No exceptions or other traces are printed in the logs. A thread dump is attached: stuck-exhaustion.txt
In all observed cases (three workers recently stuck with thread dumps identical to this one), the amount of file descriptors under /proc/pid/fd reaches about 4000 (matching the jedis_pool_max_total value). The worker process does not recover by itself even after a few days so it has to be restarted manually.
Interesting. Did you happen to notice on the redis server what everyone was stuck doing? client list is my goto there. I'm going to presume they were idle, but if everybody was sitting with a key check, I question whether your redis instance was healthy during that time. Further, if none of those clients were actually connected, I wonder if this is a failure of jedis to detect some unavailable condition for the remote.
There are only two threads doing Redis operations.
Thread-5 seems to be waiting for a response from SUBSCRIBE and main is stuck locally in SADD.
Used https://spotify.github.io/threaddump-analyzer/ to make sense of the thread dump.
Not sure how much that helps, but there it is...
"Thread-5": running
at java.net.SocketInputStream.socketRead0([email protected]/Native Method)
at java.net.SocketInputStream.socketRead([email protected]/SocketInputStream.java:115)
at java.net.SocketInputStream.read([email protected]/SocketInputStream.java:168)
at java.net.SocketInputStream.read([email protected]/SocketInputStream.java:140)
at java.net.SocketInputStream.read([email protected]/SocketInputStream.java:126)
at redis.clients.jedis.util.RedisInputStream.ensureFill(RedisInputStream.java:199)
at redis.clients.jedis.util.RedisInputStream.readByte(RedisInputStream.java:43)
at redis.clients.jedis.Protocol.process(Protocol.java:155)
at redis.clients.jedis.Protocol.read(Protocol.java:220)
at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:389)
at redis.clients.jedis.Connection.getUnflushedObjectMultiBulkReply(Connection.java:351)
at redis.clients.jedis.JedisPubSub.process(JedisPubSub.java:123)
at redis.clients.jedis.JedisPubSub.proceed(JedisPubSub.java:117)
at build.buildfarm.instance.shard.RedisShardSubscriber.proceed(RedisShardSubscriber.java:349)
at redis.clients.jedis.Jedis.subscribe(Jedis.java:2813)
at redis.clients.jedis.JedisCluster$157.execute(JedisCluster.java:1768)
at redis.clients.jedis.JedisCluster$157.execute(JedisCluster.java:1765)
at redis.clients.jedis.JedisClusterCommand.runWithAnyNode(JedisClusterCommand.java:78)
at redis.clients.jedis.JedisCluster.subscribe(JedisCluster.java:1771)
at build.buildfarm.instance.shard.RedisShardSubscription.subscribe(RedisShardSubscription.java:66)
at build.buildfarm.instance.shard.RedisShardSubscription.lambda$iterate$0(RedisShardSubscription.java:71)
at build.buildfarm.instance.shard.RedisShardSubscription$$Lambda$156/0x00007f74fdf0d0b0.accept(Unknown Source)
at build.buildfarm.common.redis.RedisClient$1.run(RedisClient.java:88)
at build.buildfarm.common.redis.RedisClient$1.run(RedisClient.java:85)
at build.buildfarm.common.redis.RedisClient.call(RedisClient.java:118)
at build.buildfarm.common.redis.RedisClient.run(RedisClient.java:84)
at build.buildfarm.instance.shard.RedisShardSubscription.iterate(RedisShardSubscription.java:71)
at build.buildfarm.instance.shard.RedisShardSubscription.mainLoop(RedisShardSubscription.java:92)
at build.buildfarm.instance.shard.RedisShardSubscription.run(RedisShardSubscription.java:106)
at java.lang.Thread.run([email protected]/Thread.java:829)
"main": awaiting notification on [0x00007f75f1455808]
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:2081)
at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:587)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:440)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:361)
at redis.clients.jedis.util.Pool.getResource(Pool.java:50)
at redis.clients.jedis.JedisPool.getResource(JedisPool.java:234)
at redis.clients.jedis.JedisClusterConnectionHandler.getConnectionFromNode(JedisClusterConnectionHandler.java:42)
at redis.clients.jedis.JedisClusterPipeline.getClient(JedisClusterPipeline.java:97)
at redis.clients.jedis.PipelineBase.sadd(PipelineBase.java:593)
at build.buildfarm.instance.shard.RedisShardBackplane.lambda$addBlobsLocation$20(RedisShardBackplane.java:900)
at build.buildfarm.instance.shard.RedisShardBackplane$$Lambda$194/0x00007f7487b1f108.accept(Unknown Source)
at build.buildfarm.common.redis.RedisClient$1.run(RedisClient.java:88)
at build.buildfarm.common.redis.RedisClient$1.run(RedisClient.java:85)
at build.buildfarm.common.redis.RedisClient.call(RedisClient.java:118)
at build.buildfarm.common.redis.RedisClient.run(RedisClient.java:84)
at build.buildfarm.instance.shard.RedisShardBackplane.addBlobsLocation(RedisShardBackplane.java:895)
at build.buildfarm.worker.shard.Worker.addBlobsLocation(Worker.java:747)
at build.buildfarm.worker.shard.Worker.lambda$start$8(Worker.java:830)
at build.buildfarm.worker.shard.Worker$$Lambda$157/0x00007f74fdf0c0b0.accept(Unknown Source)
at build.buildfarm.worker.shard.CFCExecFileSystem.start(CFCExecFileSystem.java:127)
at build.buildfarm.worker.shard.Worker.start(Worker.java:829)
at build.buildfarm.worker.shard.Worker.startWorker(Worker.java:912)
at build.buildfarm.worker.shard.Worker.main(Worker.java:879)