wg-async processing urls in batches

Basic summary

As a beginner in Rust, I would like to add to this thread with our real-life experience. We are currently facing issues which make me relate to this story (and are preventing us to switch to Rust):

We are trying to rewrite some of our services from Python to Rust and are looking to achieve the following:

Read a bunch of URLs (size varies, but about 1000 per batch)
Do an HTTP GET request for each URL asynchronously
Log the failures and process the results

What we did not succeed to do so far is:

Send the requests by batch. If we send the 1000 requests at the same time, our server closes the connection and the process panics. Ideally we could buffer them to send at most 50 at a time. We could split the batches manually, but we hoped the HTTP client or the FuturesUnordered container would handle that for us.
Handle errors. Failures should be logged and should not crash the process. We plan on using tracing-rs for the logging as it is part of the tokio stack.
Implement Fibonacci or exponential retry mechanism on failure.

For reference, the stackoverflow question where I was looking for help.

Originally posted by @rgreinho in https://github.com/rust-lang/wg-async-foundations/issues/95#issuecomment-811397783

Mar 31 '21 20:03 nikomatsakis

Can you say a bit more, @rgreinho? What caused you not to succeed, for example?

Mar 31 '21 20:03 nikomatsakis

We're still investigating, so hopefully we will find solutions soon :)

But here are our two big blockers so far:

Not sure how to limit the number of requests sent at the same time asynchronously. In Python the aiohttp web client can be configured with a smaller connection pool size and handles it for us. We hoped that reqwest would do the same. Note that we're not specifically attached to reqwest and could use surf or hreq, but unfortunately we found out that they behave the same way.

Our second hope was that the FuturesUnordered container would allow us to manage this.

A comment in the SO question pointed us to this question, where they create a Stream and apply the buffer_unordered() method on it. That will be our next attempt.

Coming from a Python background, we hoped we could simply use the same pattern, where the asyncio.gather() function from the stdlib executes the coroutines and collects the results, that's why we went with the FuturesUnordered container first.
We do not know how to retry failed requests. We did find the backoff crate and the tokio-retry one, but they don't seem to work well with FuturesUnordered. Or at least we did not succeed to get them to work together.

In Python we use tenacity to decorate our functions, and if an exception is caught, it tries to re run them for us.

We are also having problems with the error handling. We could not get the map_err() and or_else() method to work as expected. But this is probably simply do to the fact that we're new to this and did not use them properly. I'm sure we will figure it out soon. Same thing with the logging, the tracing-rs library looks fantastic.

Mar 31 '21 22:03 rgreinho

@rgreinho any chance you want to join a "vision doc writing session" and talk about this? What time zone are you in? :)

Apr 01 '21 14:04 nikomatsakis

This week's writing sessions -- I expect we'll schedule more for next week.

Apr 01 '21 14:04 nikomatsakis

Sure thing! Not sure how that works, or what exactly is expected from participants, but I'll be glad to help.

I am in the CDT time zone.

Apr 01 '21 14:04 rgreinho

The basic format is that the host asks you (and others) a bunch of questions about your experiences and then we try to collectively write a story about one of the characters. It's fun. :)

Apr 01 '21 14:04 nikomatsakis