iroh icon indicating copy to clipboard operation
iroh copied to clipboard

feat(iroh-bytes): refactor downloader queue and add progress reporting

Open Frando opened this issue 1 year ago • 3 comments

Description

This PR contains changes to the downloader:

  • Remove the Role::Provider vs Role::Candidate distinction. We added this back when we did not have content propagation in docs, and it now does not make much sense for our architecture. Either we know/assume a node has something, or not. The "inbetween" state did not make sense anymore IMO.
  • Rework the queuing logic to be based on a simple queue of pending downloads. Before, if a download could not be started because the concurrenty limits were reached, the download was considered failed and inserted, with a delay, into the retry queue. Now, if a download cannot be started, we just wait until a slot frees up, and then start it. Note that the queue is, for now, quite simple - if the next download in the queue cannot be started (e.g. because all provider nodes are currently busy with other downloads), we do not try to start the second-top download in the queue, but instead wait until a slot is freed up. We could certainly optimize this by "jumping the queue" in certain cases, this would however also need more logic to make sure that downloads cannot be "forgotten". Therefore, for now the queue is handled strictly in order.
  • Add progress reporting to downloads managed through the downloader. For this I wrote a SharedProgress handler that allows to subscribe to already running downloads: If an intent is registered for hash A, and this download is started, and while it is running, another intent is registered for the same hash, it will now receive an DownloadProgress::InitialState which contains a TransferProgress which functions as a reducer for the progress events This can be used from the client even, and further events can be reduced/merged into that struct. The PR contains a test for this concurrent progress reporting.
  • Expose the downloader in the iroh node. Download requests via the RPC API can now set a DownloadMode enum either to Direct or to Queued: the former will behave as currently (issue an iroh-bytes request directly, without a queue or concurrency limits) and the latter will add the download to the downloader queue.

Notes & open questions

  • Still need to fix/improve docs

  • The PR still has an open TODO about retry behavior, which as of today is removed in the PR. Before, if a download request failed, the downloader would try to retry the request after a timeout. Now, I think, we should only do this after all providers have been tried, and only then queue retries for the failed nodes, and only if the failure was an IO failure and not if the node reported NotFound. Does this make sense? Wanted to check in what the expected behavior is before writing this out.

  • A few more tests around the queuing behavior would be great

  • I have a partially done followup PR which adds a hook for content discovery to the downloader, will push next week.

Change checklist

  • [x] Self-review.
  • [ ] Documentation updates if relevant.
  • [x] Tests if relevant.

Frando avatar Mar 14 '24 23:03 Frando

@Frando what do think about moving the progress reporting out to another PR. That part is mostly self-contained and as far as I can tell has been reviewed by everyone and it's good to go. The queuing is a different issue that I think needs more work still, since in terms of reliability of downloads the current state in the PR is much worse than before without retries

divagant-martian avatar Apr 08 '24 15:04 divagant-martian

@Frando what do think about moving the progress reporting out to another PR. That part is mostly self-contained and as far as I can tell has been reviewed by everyone and it's good to go. The queuing is a different issue that I think needs more work still, since in terms of reliability of downloads the current state in the PR is much worse than before without retries

I will have a look tomorrow, not sure OTOH how much work the splitting is but should be doable.

I'm not too sure though if we are worse in practice than before. We always retry for 10 seconds in our configuration (default quinn timeout). We also continue using a node if it returned not found. We do not retry a node that "failed" for more than 10 seconds, or had an unrecoverable failure (bad protocol, or bad quic). I do think we should add retries here,, but I'm not sure if it is essential. On main, we handle some more retries, but we have another problem I think which is be fixed by this PR: If a lot of downloads are queued, and the resource limits are reached, newly queued downloads end up in the retry queue with increasing timeouts, so will even still wait if the slots are freed up, and might be retried while the limits are still reached, making them exceed their retry budget.

Frando avatar Apr 08 '24 22:04 Frando

Here is a PR on top of this one which adds retries: #2165

Frando avatar Apr 09 '24 16:04 Frando