spark
spark copied to clipboard
[SPARK-43301][CORE][SHUFFLE] BlockStoreClient getHostLocalDirs RPC supports IOException retry
What changes were proposed in this pull request?
Use CompletableFuture to implement retry logic, and retry operations are performed asynchronously.
Why are the changes needed?
BlockStoreClient#getHostLocalDirs RPC did not retry when IOexception occurred, and then FetchFailedException was thrown.
23/04/24 01:24:55,158 [shuffle-client-7-1] WARN ExternalBlockStoreClient: Error while trying to get the host local dirs for [148]
23/04/24 01:24:55,158 [shuffle-client-7-1] ERROR ShuffleBlockFetcherIterator: Error occurred while fetching host local blocks
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:745)
Does this PR introduce any user-facing change?
No
How was this patch tested?
local test / add UT
Was this patch authored or co-authored using generative AI tooling?
No
@mridulm @otterc Please help review, thanks in advance!
This slipped through my TODO list - will get back to it later this week, sorry for the delay @cxzl25 !
@Ngone51 @mridulm Please help review again, thanks in advance!
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!