boxo icon indicating copy to clipboard operation
boxo copied to clipboard

A timeout is required when fetching blocks

Open jclab-joseph opened this issue 1 year ago • 8 comments

GetBlocks requires a fetch timeout for each block.

Below, we simulated adverse conditions by connecting 10 clients to one server and applying a speed limit.

2025/04/09 17:24:44 PROGRESS: [6.48 6.52 0.36 5.92 6.73 6.07 6.73 6.07 3.15 1.06]
2025/04/09 17:24:45 PROGRESS: [6.48 6.73 0.36 5.92 6.77 6.11 6.73 6.27 3.15 1.06]
2025/04/09 17:24:46 PROGRESS: [6.48 6.92 0.36 6.07 6.96 6.15 6.73 6.46 3.15 1.06]
2025/04/09 17:24:47 PROGRESS: [6.5 6.94 0.36 6.27 6.96 6.27 6.73 6.48 3.15 1.06]
2025/04/09 17:24:48 PROGRESS: [6.69 6.94 0.36 6.3 6.96 6.46 6.92 6.48 3.15 1.06]

There are clients (0.36, 3.15, 1.06) that are stuck and unable to download.

When executing GetBlocks, if it takes too long to fetch a specific block, it will not be cancelled and will just hang. To improve this situation, should stop and find another peer.

boxo version : v0.29.1


Specifically, the problem occurred when there was one server (bootstrap node) that held the files and hundreds of clients tried to download the files simultaneously. Some clients will successfully download, but most will get stuck and not be able to download. I expected the node that received it first would forward the block to other nodes, but that didn't happen. periodicSearchDelay is also useless if block reception has already started.

jclab-joseph avatar Apr 09 '25 09:04 jclab-joseph

triage notes

  • @gammazero need to lean on you being the most familiar what are the next steps here

lidel avatar Apr 15 '25 14:04 lidel

Investigating: This my just be a documentation issue about how to set a timeout.

gammazero avatar Apr 22 '25 14:04 gammazero

Triage notes:

  • point to existing example if any or add one

guillaumemichel avatar Apr 29 '25 14:04 guillaumemichel

I temporarily solved this by adding a timer that is initialized when a channel is received in GetBlocks and a retry routine after context cancel.

I will try to make a sample code soon if I have time. But my way is not so good. It seems like it would be better to modify something in bitswap, but I haven't gotten that far.

jclab-joseph avatar Apr 29 '25 14:04 jclab-joseph

@jclab-joseph any reason why you can't pass a context.WithTtimeout? This is idiomatic way of making tasks time-bound, no?

https://github.com/ipfs/boxo/blob/a19e342de9c63fcf55eee628ed498b64eeefc6cd/bitswap/bitswap.go#L28-L29

lidel avatar May 06 '25 14:05 lidel

@lidel This method cancels all blocks. The issue is this situation:

  1. Block found from Peer-A.
  2. Block is requested from Peer-A.
  3. But there is no response from Peer-A.

So as a workaround, I passed a cancelable context to GetBlocks, canceled it when the timeout occurred, and then GetBlocks the remaining Blocks again.

jclab-joseph avatar May 06 '25 14:05 jclab-joseph

Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days.

github-actions[bot] avatar May 13 '25 00:05 github-actions[bot]

Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days.

github-actions[bot] avatar May 20 '25 00:05 github-actions[bot]