bazel-buildfarm icon indicating copy to clipboard operation
bazel-buildfarm copied to clipboard

Sharded workers can get stuck on Pool exhaustion

Open thna123459 opened this issue 4 years ago • 2 comments

A few of our Sharded workers can get stuck after the message Startup Time: 1s when Redis instances are restarting or getting migrated and when the Sharded worker process is starting up. No exceptions or other traces are printed in the logs. A thread dump is attached: stuck-exhaustion.txt

In all observed cases (three workers recently stuck with thread dumps identical to this one), the amount of file descriptors under /proc/pid/fd reaches about 4000 (matching the jedis_pool_max_total value). The worker process does not recover by itself even after a few days so it has to be restarted manually.

thna123459 avatar Jan 08 '21 14:01 thna123459

Interesting. Did you happen to notice on the redis server what everyone was stuck doing? client list is my goto there. I'm going to presume they were idle, but if everybody was sitting with a key check, I question whether your redis instance was healthy during that time. Further, if none of those clients were actually connected, I wonder if this is a failure of jedis to detect some unavailable condition for the remote.

werkt avatar Jan 31 '21 22:01 werkt

There are only two threads doing Redis operations.

Thread-5 seems to be waiting for a response from SUBSCRIBE and main is stuck locally in SADD.

Used https://spotify.github.io/threaddump-analyzer/ to make sense of the thread dump.

Not sure how much that helps, but there it is...

"Thread-5": running
	at java.net.SocketInputStream.socketRead0([email protected]/Native Method)
	at java.net.SocketInputStream.socketRead([email protected]/SocketInputStream.java:115)
	at java.net.SocketInputStream.read([email protected]/SocketInputStream.java:168)
	at java.net.SocketInputStream.read([email protected]/SocketInputStream.java:140)
	at java.net.SocketInputStream.read([email protected]/SocketInputStream.java:126)
	at redis.clients.jedis.util.RedisInputStream.ensureFill(RedisInputStream.java:199)
	at redis.clients.jedis.util.RedisInputStream.readByte(RedisInputStream.java:43)
	at redis.clients.jedis.Protocol.process(Protocol.java:155)
	at redis.clients.jedis.Protocol.read(Protocol.java:220)
	at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:389)
	at redis.clients.jedis.Connection.getUnflushedObjectMultiBulkReply(Connection.java:351)
	at redis.clients.jedis.JedisPubSub.process(JedisPubSub.java:123)
	at redis.clients.jedis.JedisPubSub.proceed(JedisPubSub.java:117)
	at build.buildfarm.instance.shard.RedisShardSubscriber.proceed(RedisShardSubscriber.java:349)
	at redis.clients.jedis.Jedis.subscribe(Jedis.java:2813)
	at redis.clients.jedis.JedisCluster$157.execute(JedisCluster.java:1768)
	at redis.clients.jedis.JedisCluster$157.execute(JedisCluster.java:1765)
	at redis.clients.jedis.JedisClusterCommand.runWithAnyNode(JedisClusterCommand.java:78)
	at redis.clients.jedis.JedisCluster.subscribe(JedisCluster.java:1771)
	at build.buildfarm.instance.shard.RedisShardSubscription.subscribe(RedisShardSubscription.java:66)
	at build.buildfarm.instance.shard.RedisShardSubscription.lambda$iterate$0(RedisShardSubscription.java:71)
	at build.buildfarm.instance.shard.RedisShardSubscription$$Lambda$156/0x00007f74fdf0d0b0.accept(Unknown Source)
	at build.buildfarm.common.redis.RedisClient$1.run(RedisClient.java:88)
	at build.buildfarm.common.redis.RedisClient$1.run(RedisClient.java:85)
	at build.buildfarm.common.redis.RedisClient.call(RedisClient.java:118)
	at build.buildfarm.common.redis.RedisClient.run(RedisClient.java:84)
	at build.buildfarm.instance.shard.RedisShardSubscription.iterate(RedisShardSubscription.java:71)
	at build.buildfarm.instance.shard.RedisShardSubscription.mainLoop(RedisShardSubscription.java:92)
	at build.buildfarm.instance.shard.RedisShardSubscription.run(RedisShardSubscription.java:106)
	at java.lang.Thread.run([email protected]/Thread.java:829)

"main": awaiting notification on [0x00007f75f1455808]
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:2081)
	at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:587)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:440)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:361)
	at redis.clients.jedis.util.Pool.getResource(Pool.java:50)
	at redis.clients.jedis.JedisPool.getResource(JedisPool.java:234)
	at redis.clients.jedis.JedisClusterConnectionHandler.getConnectionFromNode(JedisClusterConnectionHandler.java:42)
	at redis.clients.jedis.JedisClusterPipeline.getClient(JedisClusterPipeline.java:97)
	at redis.clients.jedis.PipelineBase.sadd(PipelineBase.java:593)
	at build.buildfarm.instance.shard.RedisShardBackplane.lambda$addBlobsLocation$20(RedisShardBackplane.java:900)
	at build.buildfarm.instance.shard.RedisShardBackplane$$Lambda$194/0x00007f7487b1f108.accept(Unknown Source)
	at build.buildfarm.common.redis.RedisClient$1.run(RedisClient.java:88)
	at build.buildfarm.common.redis.RedisClient$1.run(RedisClient.java:85)
	at build.buildfarm.common.redis.RedisClient.call(RedisClient.java:118)
	at build.buildfarm.common.redis.RedisClient.run(RedisClient.java:84)
	at build.buildfarm.instance.shard.RedisShardBackplane.addBlobsLocation(RedisShardBackplane.java:895)
	at build.buildfarm.worker.shard.Worker.addBlobsLocation(Worker.java:747)
	at build.buildfarm.worker.shard.Worker.lambda$start$8(Worker.java:830)
	at build.buildfarm.worker.shard.Worker$$Lambda$157/0x00007f74fdf0c0b0.accept(Unknown Source)
	at build.buildfarm.worker.shard.CFCExecFileSystem.start(CFCExecFileSystem.java:127)
	at build.buildfarm.worker.shard.Worker.start(Worker.java:829)
	at build.buildfarm.worker.shard.Worker.startWorker(Worker.java:912)
	at build.buildfarm.worker.shard.Worker.main(Worker.java:879)

walles avatar Apr 07 '21 06:04 walles