vpnkit
vpnkit copied to clipboard
disconnect sockets when they're removed
Hiya! I've been plagued by docker/for-mac/issues/1374 for a while now, so I wanted to dig in and see if I could do anything about it. I think I might have found a potential fix for it, but here's the kicker: I can not for the life of me figure out how to compile VPNKit, with or without this patch.
So setting aside the "this hasn't been tested because it hasn't been compiled" thing for a minute, would you be up for checking my thought process?
- It was mentioned here that the problem might be coming from VPNKit rather than docker for mac
- It was mentioned in a few places (including here) that people have noticed a buildup of sockets in a
CLOSE_WAIT
state - It's mentioned towards the end of this section in the documentation that the Mirage stack is shut down for a src/dest pair if that pair hasn't seen traffic within a certain time frame
So I'm wondering if shutting down that Mirage stack is leaving sockets around that need to be formally closed.
I'm in a bit (a lot) over my head jumping in here so I'd love to treat this PR as a discussion point to see if I can't help track down the cause of docker/for-mac/issues/1374, if that's something that would work for y'all?
Back to the compiling thing: The instructions in the README for compiling don't seem to quite cover it. The guide here gets me closer, but I still run in to issues. If anybody has steps I can follow to configure ocaml on my Mac and get vpnkit to compile, I'd love to see them! If not, I'll keep chipping away at it and try to submit a different PR with instructions
Thanks for your contribution!
I am working on refreshing the dependencies of vpnkit to make it easier to compile, but this needs quite few changes to be approved by @djs55 first (who is coming back from PTO next week and who will be able to comment more on your issue).
Brilliant, thank you!
@TylerLubeck thanks for taking a look at this! I hadn't considered a possible interaction between the silent filter rule dropping and the connection leak.
If the remote machine closes the connection then the socket on the Mac host should enter the CLOSE_WAIT state (as observed) and the call to Mirage_flow.proxy will call close
on the half of the connection going to the VM which should transmit a segment with the FIN flag set. If the filter rule was dropped then replies would never be heard (I believe new outgoing packets would be sent to a fresh TCP/IP stack and ignored). The original connection in the original stack would remain in the state Send_fin
for a long time (I suspect).
There's also a (slightly tacky) background keep-alive thread whose job is supposed to keep prodding the host to make sure the connection still exists but this may not be working. I notice the Port.write function does not update the last_active_time
but the packet receive callback does. So if the keep-alive were not provoking the VM to send a packet, the last_active_time
would not be updated making the timeout more likely to hit.
I'll see if I can add some more debugging information, so we can see more easily what's going on. Thanks for your help so far!
So, to be honest with you, a lot of what you said is going over my head. I have no particular reason to suspect that it's a bug in the silent filter rule dropping other than that that's the first area I traced it to that looks kinda funny.
I'd love to get to a point that I can compile things and try them, but until then I would be more than happy to take any binaries you can produce and test them either myself or with a handful of people I work with that are experiencing the same issue if that's something you'd be up for.
@TylerLubeck if you do:
$ opam remote add vpnkit ./repo/darwin
$ opam update
$ opam install --deps-only vpnkit
$ make
that should now work. If not, please report any issue and I will fix it :-)
We are still trying to upgrade all the dependencies to the latest version (so the opam remote add
line would not be needed anymore) but it a bit more painful than expected, see the discussions on #262
EDIT: replaced slirp
by vpnkit
Hey @samoht, two things came up:
opam install --deps-only slirp
fails with [ERROR] No package named slirp found.
I tried running opam install --deps-only vpnkit
like the README says now, and it fails with [ERROR] The compilation of mirage-conduit failed at "ocamlfind query conduit.mirage".
The log file associated with the install contains this:
# opam-version 1.2.2
# os darwin
# command ocamlfind query conduit.mirage
# path /Users/tylerl/.opam/4.04.1/build/mirage-conduit.2.2.0
# compiler 4.04.1
# env-file /Users/tylerl/.opam/4.04.1/build/mirage-conduit.2.2.0/mirage-conduit-10012-dc3e96.env
# stdout-file /Users/tylerl/.opam/4.04.1/build/mirage-conduit.2.2.0/mirage-conduit-10012-dc3e96.out
# stderr-file /Users/tylerl/.opam/4.04.1/build/mirage-conduit.2.2.0/mirage-conduit-10012-dc3e96.err
I'm happy to get you more logs, move this to a separate issue, or just sit tight until your work over in #262 is done. Just let me know, and thank you for your help so far!
@TylerLubeck thanks for your patience. It should work better with #266 (and indeed you should be using vpnkit
instead of slirp
).
@samoht I still see the ocamlfind query conduit.mirage
error after #266. It looks like the mirage-conduit.2.2.0 package in repo/darwin is missing an upper bound on conduit compared to https://github.com/ocaml/opam-repository/blob/master/packages/mirage-conduit/mirage-conduit.2.2.0/opam -- my installed version is conduit 1.0.0 and ocamlfind query conduit.mirage
fails (it succeeds with 0.15.0). I tried to update the dependencies in repo
, but the update script fails with:
The following dependencies couldn't be met:
- vpnkit -> mirage-conduit = 2.2.0 -> mirage-dns < 3.0.0 -> dns < 1.0.0
Your request can't be satisfied:
- Conflicting version constraints for dns
- dns.0.19.0 is in conflict with mirage-types-lwt.2.8.999
- dns.0.19.1 is in conflict with mirage-types-lwt.2.8.999
- dns.0.20.0 is in conflict with mirage-types-lwt.2.8.999
- dns.0.20.1 is in conflict with mirage-types-lwt.2.8.999
Ha I think I found the issue: https://github.com/ocaml/opam-repository/commit/dba6c74df4a9f66ce31a383b23e3cd25fcb2365b added an upper bound on dns. Fixing that now.
See https://github.com/moby/vpnkit/pull/267 for a fix. Don't forget to add a local repo as master is still using a bunch of unreleased packages (hopefully not for long).
Success! I can now build everything. I'll keep poking around to see if I can track this thing down. Thanks for your help!
I've started having more of a look too. I added this debug patch: https://github.com/djs55/vpnkit/commit/41cdadc58cf89067039ad0a6c77a8b4da25be775 which
- logs the connection table every 10s
- times out switch ports after 5s
I've let it time out an ssh connection to the host and the socket is still open after 1.5hrs so far (which seems bad). Out of curiosity I'm going to leave it overnight to see if the connection is still open in the morning.
That's great, thank you!
Another question for y'all - how can I get these custom versions running with Docker for Mac so I can reproduce my test cases? I've been digging through documentation and have tried to replace the binaries with my new compiled ones, but no luck with anything so far. Any suggestions?
@TylerLubeck could you try a binary for me? The CI just built this from https://github.com/moby/vpnkit/pull/269. If you download and untar the archive, there should be a vpnkit
binary somewhere inside. If you replace the version in /Applications/Docker.app/Contents/Resources/bin/vpnkit
and restart the Docker app, it should (hopefully) close more connections when timing out switch ports (as you suggested!)
Awesome, thank you. I'll get this distributed out to a few folks I'm working with that are able to reproduce this on a much more regular basis, and we'll monitor for a bit and get you logs as they come up.
Thank you so much for digging in to this!!!
@TylerLubeck have you heard any feedback? I think I'll probably merge the PR in #269 since I think it's good to have anyway, but I'd love to know if it makes a difference to the people suffering from the original problem.
Unfortunately I haven't heard much back yet, but I'll bug some folks to get some information
i'm having this issue as well on rc3
i juste tried the binary
but it didn't helped. I have no problem on rc1
. In my case i have haproxy
and each haproxy check leave connection on CLOSE_WAIT
Hey @djs55, just heard back from some folks and unfortunately this didn't seem to help the problem. That said, we're also not seeing CLOSE_WAIT
anymore so I'm not sure where to go. This also doesn't line up with what @sulphur says above, which is odd. I'm gonna set us up to run this overnight and see if we can get some CLOSE_WAIT
s back.
any news on that ? I'm still hagin the CLOSE_WAIT and FIN_WAIT_2. I've just tested on 17.10.0-ce-mac36. I've moved back to 17.06 for now where lsof -p $(pgrep vpnkit) | wc -l
is around 100 and can stay like that for days on my machine. With the latset version my dev environement (which is quite heavy) jumps to 2000 in like 2 minutes after starting up.
I'm having the same case as @sulphur fwiw. lsof -p $(pgrep vpnkit) | wc -l
simply climbs to whatever limit (quickly).
Some additional info that may or may not help:
-
In my environment, this only seems to happen when Docker containers are running
socket.io
or an HTTP server on Node.js - and only when they run along side a separate HAProxy container. -
The amount of open connections does not reduce even after running
docker stop
anddocker rm
on all of my Docker for Mac containers. Quitting the Docker for Mac daemon is the only way to reduce the number of open connections / finally do away with theCLOSE_WAIT
/FIN_WAIT_2
connections.
I've got some more detail here: https://github.com/docker/for-mac/issues/1374#issuecomment-342229561