libgit2 Gracefully handle closed keep-alive connections

We naively assumed that we could re-use an existing HTTP connection if we were using keep-alive, which is often true but definitely not always true and the server is permitted to close the connection. Handle this gracefully - httpclient will return GIT_RETRY if the first read of a response on a socket that has been kept-alive returns 0 (EOF). Retry exactly once in this case.

Note that we're not checking for failures to write the request. We should probably do that in this PR, but usually it's the read that will actually fail on a closed socket. But I suspect there are plenty of failure modes in the write case, too.

Mar 30 '23 11:03 ethomson

cc @arroz

Mar 30 '23 11:03 ethomson

@ethomson When I was reading your patch, I realized the two functions that do the replay (http_stream_read and http_stream_write) are a place that had what I was looking for before: the request buffer still exists, a connection is open (if necessary), request sent and at least part of the response read back, all inside the http stack.

I think we can contain the retrying mechanism to these two functions. This would probably by another loop around the sequence of generating the request, sending it and reading the response. This is enough since the keep-alive problem only happens between requests, so we don't have to worry about it on the subsequent reads, when stream->state != HTTP_STATE_NONE. My guess is we should probably not use replay_count (which AFAIK is used for authentication probing and handling redirects) for this, but another variable that allows us to try only twice, resetting the replay_count so it can go through all the same steps again.

The hard part is replacing the socket/TLS transport with a fresh one from there, but I think it's doable (maybe there's already an API for that, didn't check yet).

This would be nice, since we wouldn't need to leak this detail to the upper layers that deal with a generic transport and not http specifically. It would also avoid the need of dealing with retrying all over the place (at least in two different places of the fetch negotiator, and one on the push negotiator).

I can give this a try tomorrow and over the weekend, I'll send you the patch assuming I don't hit any unexpected wall.

Mar 31 '23 00:03 arroz

Hum, looks like http_stream_write can send requests without immediately reading the response back if request.expect_continue is 0. Then, http_stream_read_response is eventually called from git_smart__recv_cb and fails, and we can't retry there because we are already out of our nice two functions that have enough information to replay the request. 😩

Mar 31 '23 01:03 arroz

Yes, this is disappointing. But I think that a refactoring here would be a deep investment. 😢

Mar 31 '23 08:03 ethomson

I was trying to store the initial buffer/len for a write request in the http_stream to see if we could get away with automatically retrying on subsequent operations, but doesn't work… on push, after the initial payload the pack writer keeps calling the write function, and this only fails on the second write, not the first.

I guess we really need to handle the retrying loop at the upper layers… I'll try to write a function that takes a closure with the logic that sends a request and processes the response and see if we can at least wrap the retrying loop nicely.

Was there any debate in the past about replacing all the http code with any library that may exist out there (libcurl, being highly portable, being a good candidate)? Seems like this should be a solved problem that we could just leverage.

Apr 01 '23 04:04 arroz

We used to use libcurl but it adds another dependency that we have to manage, which is particularly annoying on windows. Plus it gives us less control over crypto settings.

We wanted to move to a single http provider - having too many different variants (it used to be three) is awful for many reasons. After https://github.com/libgit2/libgit2/pull/6533 the end is in sight for bringing this number to one.

Making that one libcurl would have probably been possible but vendoring it vs linking against the existing one and dealing with its crypto stack settings is very annoying. (Nor, I think, did it solve this problem for us, but I may be misremembering.)

Apr 01 '23 06:04 ethomson

I think inverting the structure of the "smart" transports so that we have an http transport and an ssh transport that get the high level commands (like upload pack) and then use a shared smart handling library would resolve this reasonably well.

I've never really liked the structure. It seems like in the status quo, adding a new "smart sub transport" would be easy, since you're just given the git protocol steam directly. Unfortunately that's not really true since you're trying to juggle connection management with an imperfect view of what you're actually doing.

I wonder if it's actually not all that hard to actually do this refactoring. 🤔

Apr 01 '23 06:04 ethomson

So, a draft on top of your PR: https://github.com/libgit2/libgit2/pull/6542

I still didn't address push, but for fetch, this kinda works! This is incredibly hard and annoying to test since it never really fails the way I expect it to when pausing the debugger on multiple places. But, adding sleep(6) statements on the fetch negotiator, which is where things will realistically take time, DOES work, and the connection is reset and reconnected.

Some thoughts so far:

I'm not sure if the reconnect happens by accident or if the code is ready for this. During the second pass of http_client_connect, connected is 1 and keepalive is 0. The reason is the header parser sets keepalive to 0 and couldn't parse the response and set it to 1 since the server had already closed on the previous pass. Mayyybe this is how it was designed to work? Or just happens to work? But it works!
If I pause the debugger at lower levels of the streams/socket etc I can sometimes make this fail with "broken pipe" or "connection reset by peer". Since this is rather confusing it's possible I'm actually causing the server to reset due to a timeout mid-request, and not between requests, which I guess would be fine. But we should probably consider interpreting the errno after all the socket calls and see if it smells like a dropped keep-alive connection.
I think you're right in suspecting using GIT_RETRY will cause problems. We probably should have another internal error for this situation. For example, in your comment https://github.com/libgit2/libgit2/pull/6541#discussion_r1153330134 if I understand correctly, it works but in a weird way: it will retry GIT_HTTP_REPLAY_MAX times but never succeed because the connection is never actually reset, so it fails every time with the same error.
Maybe we should have a t->supports_keep_alive like we do t->rpc, set that to true only for HTTP transport, and only run the keep-alive retrying if that flag is true? ~~I have no idea if this actually works at all with SSH connections, although I think those don't have the same problem (they're supposed to remain open until closed explicitly).~~ (SSH has rpc = 0)

Anyway, I'll try to address push next, since we can fix the actual dropped connection mechanism later.

Apr 02 '23 22:04 arroz

Update: push is now working. However, it required bubbling the error up explicitly from the transport. Some transports are already more detailed than others (openssl, for example). Both stransport and socket were just returning -1, so they are now returning GIT_EEOF if the error from below seems like a dropped connection (I'm abusing slightly on socket to consider broken pipe and connection reset by peer as EOF to make things simpler, but it's not worse than status quo that just returns -1). I also changed git_stream__write_full to forward the error instead of converting to -1.

I'll do the same for the read functions tomorrow, to improve the error detection.

I wont be able to do this for the windows transport stuff, though. I have no way to test or even compile it. I'll try to do it for openssl as well, by disabling stransport.

Apr 03 '23 04:04 arroz

Ok, this should be ready to merge into yours. I standardized on GIT_EEOF for dealing with the keep-alive issue, since GIT_RETRY is used for auth and HTTP redirection.

I updated secure transport, socket and OpenSSL transports. Didn't touch the others. They need to return GIT_EEOF when the connection is dropped as well.

This seems to be working for both fetch and push. Connection is re-opened if it takes too long to send the next request.

Apr 03 '23 22:04 arroz