casync icon indicating copy to clipboard operation
casync copied to clipboard

Rewrite casync-http to use curl multi

Open elboulangero opened this issue 5 years ago • 7 comments

This commit brings parallel downloads to casync-http, by using the curl multi interface, and more precisely the "select()" flavour. For details see https://curl.haxx.se/libcurl/c/libcurl-multi.html.

Libcurl has two ways to achieve parallel downloads:

  • for HTTP/1, it can open parallel connections. The maximum number of parallel connections is user-defined, through MAX_HOST_CONNECTIONS and MAX_TOTAL_CONNECTIONS.
  • for HTTP/2, it can attempt to multiplex on a single connection. The maximum number of parallel downloads in this case is negociated between the client and the server (we talk about number of streams in the HTTP/2 jargon).
  • (note that libcurl used to do pipelining over HTTP/1.1, but this is no longer supported since 7.62.0, and casync-http doesn't use it anyway)

Accordingly, this commit also introduces two new command-line arguments to better control the behavior of parallel downloads:

  • --max-active-chunks is the sum of 1. the number of chunks in the curl multi and 2. chunks downloaded and waiting to be sent to the remote. It allows to limit the number of chunks waiting in RAM, in case we download faster than we can send to remote. It also gives a limit for the maximum number of concurrent downloads.
  • --max-host-connections is for the case where libcurl opens parallel connections to the server. In all likelihood it's only used for HTTP1.

We probably want a large number for max-active-chunks, to ensure we don't starve the libcurl multi handle, but at the same time we probably don't want to open too many connections in parallel, and that's why max-host-connections is a much lower number. It seems to be a sensible default, according to my understanding so far. User might want to adjust these number for their specific use-case.

Note that the command-line argument --rate-limit-bps doesn't make much sense anymore, since it's set for each chunk, but now chunks are downloaded in parallel, and we don't really know how many downloads are actually happening in parallel. And from Daniel Steinberg:

We don't have settings that limit the transfer speed of multiple,
combined, transfers.

So we might want to completely remove this option, or rework it somehow.

Note that this commit removes the wrapper robust_curl_easy_perform() introduced in 328f13d. Quick reminder: this wrapper was used to sleep and retry on CURLE_COULDNT_CONNECT, and allowed to workaround what seemed to be a misbehavior of the Python Simple HTTP Server. Now that we do parallel downloads, we can't apply this workaround "as is", we can't just sleep. So I removed the wrapper. The issue is still present and reproducible though, but I just assume it's a server issue, not a casync issue.

Regarding unit tests

This commit also opens interesting questions regarding the unit tests. For now we're using the Python Simple HTTP Server, which can only serve requests sequentially. It doesn't allow to really test parallelism. It's not really representative of real-life scenario where, I assume, chunks are served by a production server such as apache or nginx. Additionally, I think it would be best to run the test for both a HTTP/1 server and a HTTP/2 server.

One possibility is to use nginx, it's easy enough to run it. nginx can serve HTTP/2 only if TLS is enabled though, and out of the box casync-http will fail if it can't recognize the certificate. So we might want to add a command-line argument to trust any random certificate.

Additionally, nginx requires root, maybe not very suitable for a test suite, however there might be some workarounds?

Another possibility is to run nghttpd. This option is light on dependencies, in the sense that libcurl already relies on libnghttp2, however I didn't succeed in using this server yet.

TODOs

There are a few todos left, mainly:

  • add argument or env variables support in casync-tool, to match new arguments in casync-http.
  • documentation, readme and such.
  • some details in the code, see MR in GitHub.

elboulangero avatar Mar 23 '19 11:03 elboulangero

Ping?

elboulangero avatar Apr 09 '19 12:04 elboulangero

hmm, so if i get this right, then in this PR we'll synchronously run the curl logic, then synchronously the ca_remote logic and then the curl logic again and so on. Ideally both logics would run asynchronously in the same poll() loop, i.e. so that we get the fds to wait for out of curl and out of caremote (the latter is easy since it's only two) and pass them to the same poll() call. Can we make that happen with libcurl? I don't know curl well enough I must admit, and the reason I did my original http client the way i did (with its synchronous behaviour) was laziness. But if we fix this for multiple getters maybe we can fix that too?

Not sure I understand you, so let me detail a bit how things work.

There's actually only one poll for all events (curl and casync altogether). This happens in curl_multi_wait() at https://github.com/systemd/casync/blob/3c1f551e53d43592813eff93a90f0190a952f902/src/casync-http.c#L999

The pros of curl_multi_wait is that we don't have to deal with all fds that are in use by curl and feed them to poll manually, because that exactly what curl_multi_wait does already. Additionally, this function allows user to give additional fds to add to the poll, and that's what we do here: we give casync two fds, so that they're part of the poll.

If you want to peak into curl's implementation for curl_multi_wait, here are some quick links:

  • Calling the poll wrapper in curl_multi_wait: https://github.com/curl/curl/blob/c4e0be44089408d65220ab0174ad3443724092a0/lib/multi.c#L1103
  • Calling poll proper (if poll is supported): https://github.com/curl/curl/blob/c4e0be44089408d65220ab0174ad3443724092a0/lib/select.c#L439

So it's an easy way to poll on both casync and curl fds, basically. The downside, maybe, is that when we're out of this function, we don't know which fds were triggered, all we can know is how many fds were triggered (which doesn't help much). That's why a "loop iteration" involves running both curl logic and casync logic, in case something happened (we're not sure).

That could be done differently, we could do the poll ourself instead of using curl_multi_wait, but basically it would mean replicating what curl does in curl_multi_wait, and I don't see any benefit. But I'm no poll expert, so I might miss something.

Also, I am not sure why we need a queue for the requests?

So we need to keep tracks of the active requests somehow. Active requests being either easy handles added to the curl multi, or either chunks downloaded, and waiting to be sent to the remote.

I implemented a queue mechanism because it seemed to be suitable, and having 3 queues for the 3 states possibles (ready, in progress and completed) seemed to make things easy. For example, by keeping a bit of stats inside the queues (n_added, n_removed), it's then super easy to display statistics at the end. And this is valuable to see where time is spent.

For example while testing locally (ie. super fast localhost network), I can see that the average size of the completed queue is 62 (out of 64 active chunks), which means that most of the time, we're waiting for chunks to be sent to the remote. So the casync communication between the two casync processes is the bottleneck.

OTOH, while testing with a remote server, obviously things are different, and this time it's the average size of the inprogress queue that is close to 64, which means that most of the time we're waiting for chunks to be downloaded.

Of course, this is not the only implementation possible, and instead of having 3 queues, we could have a "state" enum, and for each chunk we could keep track of the state it's in. No more queue, just a static array of chunks. But I'm not sure it would make the code easier to read and maintain.

Performance wise, it don't think there's any cons in having these 3 queues. It's small amount of data (max-active-chunks x 3). Right now, we use dynamic memory only because at some point I wasn't sure if the queue would be of fixed size or not. Right now, it turns out that it's a static size (fixed by the parameter --max-active-chunks), and if we agree on this implementation, I'll rewrite that part to make it static and remove all the malloc/free.

note that CaRemote already implements one anyway?

This part I don't really understand. Can I make use of that on the casync-http side?

elboulangero avatar Apr 10 '19 04:04 elboulangero

(Note that I fixed all the details you mentioned above. Since it was trivial enough I hit "Resolve" button myself, but truth to be told, I'm not familiar with reviewing stuff on github, and maybe you prefer to hit "Resolve" yourself, so please don't hesitate to tell me how it should be done).

elboulangero avatar Apr 10 '19 04:04 elboulangero

Hi and sorry for being that long, lately I've been under water due to other projects.

I split all the work that is not really related to multi http request, but more refactoring, into https://github.com/systemd/casync/pull/219, which can be merged to master if it's good enough. It's mostly shuffling code, nothing polemic I think, but I'm sure it will benefit from a round of review / rework.

Due to other tasks ongoing on my side, I won't be able to work further on that, hence let me introduce @gportay who will keep the good job going. I'm still around and will help him catching up.

Thanks for the collaboration and I hope to see u again through more casync PR!

elboulangero avatar Jun 25 '19 09:06 elboulangero

It would be great to have some tests for this. It is far from trivial ;) I agree we should switch to one of the "real" implementations, but I'm not sure which one is appropriate. By nghttpd do you mean https://www.nghttp2.org/? That seems to be dead (website from 2015) and is not packaged for Fedora, which would make things difficult for us.

I think nginx is best, as it's real in the sense that it powers the Internet for real, and it's also easy enough to set up for local tests. The only downside I see is that it requires root, and in Debian we build packages and run test suites without root. I don't know about fedora.

I also proposed nghttp2 because it's a dependency of curl (under the form of libnghttp2), so it's already around when we build casync. I'm not familiar with the project though. It seems that it's maintained: https://github.com/nghttp2/nghttp2/graphs/code-frequency.

I think it'd be easier to review this patchset if some parts were split out. E.g. the part to move arg_protocol enum higher could be separated, and it would just make things easier to review. I suggest some other parts to split out inline.

Done in https://github.com/systemd/casync/pull/219

Like @poettering suggested, we want to pull in the list.h implementation from systemd. This will remove a large chunk of this patch too.

@gportay will take care of that.

As for the general approach, I think it's reasonable. According to the documentation, the curl multi interface supports both doing poll internally and integrating into an external event loop. I think it's reasonable to start with the approach you picked. We have the option to switch to an external loop later on if necessary.

Yes indeed. It's very easy with curl, as you basically just call curl_multi_wait() and curl_multi_perform() one after another. It's a bit more difficult with casync though, as you have to understand very well the protocol that the casync remotes speak to be sure that you do the right calls, and go to poll at the right moment after you did everything you could. Otherwise it's easy to get stuck.

elboulangero avatar Jun 25 '19 10:06 elboulangero

I have push-force a new version. It is based on #219, with more atomic commits.

gportay avatar Jul 04 '19 19:07 gportay

@poettering, @keszybz is there any chance you have time to make a review on all PR? I assume you are very busy these days :) Thanks anyway.

gportay avatar Jul 18 '19 03:07 gportay