spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-43301][CORE][SHUFFLE] BlockStoreClient getHostLocalDirs RPC supports IOException retry

Open cxzl25 opened this issue 1 year ago • 2 comments

What changes were proposed in this pull request?

Use CompletableFuture to implement retry logic, and retry operations are performed asynchronously.

Why are the changes needed?

BlockStoreClient#getHostLocalDirs RPC did not retry when IOexception occurred, and then FetchFailedException was thrown.

23/04/24 01:24:55,158 [shuffle-client-7-1] WARN ExternalBlockStoreClient: Error while trying to get the host local dirs for [148]
23/04/24 01:24:55,158 [shuffle-client-7-1] ERROR ShuffleBlockFetcherIterator: Error occurred while fetching host local blocks
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745) 

Does this PR introduce any user-facing change?

No

How was this patch tested?

local test / add UT

Was this patch authored or co-authored using generative AI tooling?

No

cxzl25 avatar May 30 '24 10:05 cxzl25

@mridulm @otterc Please help review, thanks in advance!

cxzl25 avatar Jun 03 '24 13:06 cxzl25

This slipped through my TODO list - will get back to it later this week, sorry for the delay @cxzl25 !

mridulm avatar Jun 10 '24 20:06 mridulm

@Ngone51 @mridulm Please help review again, thanks in advance!

cxzl25 avatar Jul 29 '24 08:07 cxzl25

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar May 26 '25 00:05 github-actions[bot]