trio icon indicating copy to clipboard operation
trio copied to clipboard

Possible race condition in TLS handshake on windows

Open Badg opened this issue 2 years ago • 3 comments

Hello hello! I've said this before, but it definitely bears repeating: thanks for all the hard work on Trio. It's still part of my go-to async toolbox, even after so many years and kilometers moved :)

The problem

I'm getting two different errors, both of which ultimately result in a broken resource. They have the same source call site -- await stream.do_handshake() (to force the TLS handshake so that an SNI is available). They're very unpredictable:

  • I can't see any pattern as to which of the two errors gets raised
  • So far, the first request always succeeds
  • Sometimes, the second request already encounters an error
  • Sometimes, a few tens of requests get through before finally erroring

As far as I can tell (again, these are transient errors, so I can't be 100% sure):

  • It appears that if I comment out the call to do_handshake and return a static SNI hostname, everything works as expected. Or at least, with this one single change, I was able to do several hundred requests spread out over however long it took me to eat dinner
  • I still saw errors after removing the .sni_callback = sni_callback line, so it doesn't appear that the callback itself is affecting things
  • I still saw errors after adding trio.sleep calls before stream.do_handshake, so that probably doesn't have any effect (not that I'd expect it to)
  • I still saw errors after changing the port mapping (trying to rule out some unknown weird windows thing)
  • It's definitely not somehow reusing socket objects or something truly bizzare
  • (there may be some other things I tried that I'm forgetting)

Could this be a race condition in the TLS handshake? Partial handshakes come to mind, but... TLS is definitely not my forte, so I have no idea if that could be relevant or not.

I've included both the source code (it's a WIP, so don't mind all of the construction mess) and screenshots of the errors (sorry... windows copy/paste from the console is garbage) in this gist. I also left in all of the machete-mode-debugging-print calls, so you can see where I instrumented while troubleshooting.

Background

In my free time I'm playing around with building myself a slick development environment, because I have tons of ideas, and I want to make it as easy as possible to execute on them. There are a lot of moving parts here, and in most cases I'd say this is completely unnecessary. But hey, it's my free time, yknow? The relevant bit is this:

  • When I'm testing stuff in my browser, I want to use TLS, and I want to use normal hostnames. So for example, I want to be able to say https://fooproject.<subdomain.i.own>, https://barproject.<subdomain.i.own>, etc, and have these be routed to the appropriate docker container
  • I might be running multiple projects at the same time, and they might depend on each other
  • I should be able to create a new project purely by cookiecuttering templates; I don't want to be mucking about in nginx config files
  • I want the destination port to be dynamic. In other words, I want each project docker container to just be able to select any open port, and then the terraform-based launching service can just read the port allocation and dynamically update the corresponding port for the project domain
  • I also want to be able to access non-http resources, for example, postgres and redis, also using the same host-based routing

So the solution looks like this:

  1. Run unbound on my local machine. Normally it just forwards to cloudflare's DNS, but for wildcard domains *.local.taev.dev (which I own), it has a hard-coded response for 127.0.0.1
  2. create a CSR locally on my devhost for a DV wildcard cert for *.local.taev.dev
  3. forward that CSR to a control plane server that relays it to let's encrypt and handles the actual domain validation challenge
  4. return the signed cert back to my devhost
  5. within the devhost application, I've written a trio-based reverse proxy. It terminates the TLS connection, forwarding requests via TCP to the destination docker container. (I'll also add path-based routing for HTTP, but haven't quite gotten that far yet). The host name for routing comes from the TLS SNI callback.

Step 5 is where I've run into trouble.

Badg avatar Feb 08 '23 23:02 Badg

Huh, this is mysterious.

The ssl.SSLSyscallError is probably the ssl module being annoying: it tends to raise that whenever openssl returns an error but it can't find a more specific error to return. In this case, openssl isn't doing any syscalls; we use it in sans-I/O mode, where it just does cryptography on data buffers. So....... I guess something is wrong with the handshake data... somehow? I guess one thing that could plausibly be wrong would be if the remote side closed the connection abruptly, so the handshake data was truncated.

The WinError 10053 is weird -- that's ECONNABORTED, which I thought could only happen when the initial TCP handshake breaks somehow, so it should only be reported from accept or connect. But here we're getting it from recv. What does that mean? idk.

I don't think there should be any race conditions in do_handshake... it shouldn't be manipulating any global state, and even if you call it multiple times at once on the same connection, there's some interlocking to make sure that only one connection actually does the handshake, and the others wait. You could get weird results if you're cancelling do_handshake, but I don't think you are?

I think our only really solid clue is that the explicit call to do_handshake fails, but if you skip it then everything works. That's super weird though, because if you don't call do_handshake explicitly, then as soon as you try to send or receive on the connection then trio will call do_handshake for you. (Actually, to double check: in your test with the static SNI over dinner, did you send data back and forth after the connection once it was established, or just make the connection and then immediately close it again?)

If this were happening on my machine my next step would probably be to pull out wireshark to see what exactly is happening on the network on the successful vs failed cases. idk if you're comfortable wireshark though.

njsmith avatar Feb 10 '23 08:02 njsmith

Ah, that reminds me about a particular plot-thickening. Before I reread the Trio docs and saw do_handshake, I was reading a very small amount of data -- single digit bytes -- using receive_some, and then passing around that chunk as an initial_bytes parameter through several different functions. Still correct, and ultimately equivalent, but, well, pretty gross. But here's the thing: I also got errors doing things that way. So calling receive_some with a very small amount of data, followed by async iterating, caused the same error. That's part of why I thought it might be a race condition.

I'm definitely not deliberately cancelling the handshake, and I don't know how else it could get cancelled. I'm not swallowing any BaseExceptions and I'm not ctrl+C'ing, so... as best I can tell, "the error is coming from inside the house".

The way I've been testing the reverse proxy has been to just set up two different python -m http.server <port> as backends. And that's another thing to mention, there's been a full HTTP request/response cycle ahead of this. I did get a little confused that I was seeing exactly 1 TCP connection per HTTP request (this is still the case, even with it now working -- see below); it seems a little weird to me that http.server wouldn't support keepalive, but it's definitely not the strangest thing ever. So at any rate, during the first connection, data is definitely exchanged from both sides. During the second connection, the error happens before the routing, so it never even gets a chance to open up the connection to the backend server.

As a quick side note, I've refactored the code to do the routing lazily after the first all to receive, so I don't have any urgency here. But I have the old code in my git history, and "be kind, rewind" and all that! To that end: my comfort level with wireshark is "welp, I've never used it, but I've been meaning to add it to my toolbox". Unfortunately my time right now is really limited. I'll try to take a stab at it sometime this week, but I'm going to have to timebox myself to a day's worth of free time. I'll update (even if just to say I didn't finish) hopefully sometime next weekend.

Badg avatar Feb 11 '23 23:02 Badg

I didn't get a chance to look at this yet -- I've gotten really sucked into implementing the actual reverse proxy for HTTP, which has been... surprisingly tricky actually, I hadn't anticipated how complicated things get when you try and make the proxy<->client connection independent of the proxy<->upstream connection (so that even if an upstream server doesn't support keepalive, the downstream client can). ~~That being said, I'm still hoping to try to get a wireshark capture. If I do, I'll edit this post; however, if I don't get to it within 2 weeks, I've probably reached the "point of no return" in terms of focus drift.~~

Edit: @njsmith I got a wireshark capture; also, it appears that the issue has returned (after a second refactor). It looks like the browser is sending a RST packet immediately after the Server Hello? Anyways, I added the capture to the original gist, direct link here

Badg avatar Feb 19 '23 12:02 Badg