dinghy icon indicating copy to clipboard operation
dinghy copied to clipboard

NFS mounting fails when I re-launch dinghy

Open hezhizhen opened this issue 6 years ago • 33 comments

I have been facing the problem these days. I start dinghy and then stop it using dinghy halt, and then when I start it again, the logs will be like as follows:

Starting the dinghy VM...
Starting NFS daemon, this will require sudo
Waiting for NFS daemon...
NFS mounting failed
Starting the FsEvents daemon
Starting DNS and HTTP proxy

The only solution is to restart my laptop. But once I re-launch dinghy, NFS always fails to mount. Is there anything wrong with my settings and workflow? What should I do to fix it?

  • MacOS: 10.13.3
  • dinghy: 4.6.3

hezhizhen avatar Mar 20 '18 03:03 hezhizhen

Same for me, here's some output from /usr/local/Cellar/dinghy/4.6.4/local/var/dinghy-NFS.output

UNFS3 unfsd 0.9.23 (C) 2009, Pascal Schmidt <[email protected]>
bind: Address already in use
Couldn't bind to udp port 19091

I tried through all variants like dinghy halt, dinghy restart or even dinghy destroy. Even after a simple halt, ps auxwww | grep nfs doesn't list a single process. ~~In addition, netstat does not list the port to be in use.~~

dinghy: 4.6.4 OSX: 10.13.4

Edit 2018-06-15: netstat actually lists the port as in use, but the process behind isn't alive anymore. I was unable to find a way to close it, even shutting down the network interface didn't work.

m-bymike avatar Jun 12 '18 21:06 m-bymike

Well, darn. We originally used to select a random port to run NFS on with each start, exactly because of this problem. But I could no longer reproduce the problem back in December and it was causing problems for some people who run stricter firewalls, so I removed that randomization and the port is just hard-coded now. Sounds like it's still a problem for some.

I'm not sure what the best fix is here, it may involve making changes to unfsd to allow address reuse.

codekitchen avatar Jun 22 '18 15:06 codekitchen

@codekitchen Thank you for your update. Do you have an Idea which versions of UNFS3 did close the port successful back in December? It would be a good starting point for further investigations...

m-bymike avatar Jun 25 '18 09:06 m-bymike

I don't believe it has changed in unfs3, it just appears to affect some people more often than others. I never see the problem anymore when doing a restart or halt/start, for example. I don't have any theory as to why, yet.

codekitchen avatar Jun 25 '18 14:06 codekitchen

@codekitchen maybe you need to buy an old macbook...

hezhizhen avatar Jun 26 '18 00:06 hezhizhen

Haha, I don't know what counts as "old" but I use a 2014 Macbook Pro and a 2013 Mac Pro.

Something that just occurred to me, what VM provider are you both using? I've been using xhyve for a couple years now, maybe this is more prevalent with Virtualbox? I'll do some tests.

codekitchen avatar Jun 26 '18 01:06 codekitchen

@codekitchen It seems that my macbook is a newer one (early-2015 MacBook Pro)...

I used docker first and then switched to dinghy, so the VM provider is VirtualBox, whose version is 5.1.6 r110634 (Qt5.5.1).

Maybe you could create a poll to see the effect of VM providers and their versions.

hezhizhen avatar Jun 26 '18 04:06 hezhizhen

Late 2013 13-inch Macbook Pro... But running Virtualbox here...

m-bymike avatar Jun 26 '18 05:06 m-bymike

Funny enough: The problem doesn't show with xhyve. Just tried it, and I'm able to restart dinghy as often as I like...

m-bymike avatar Jun 26 '18 05:06 m-bymike

Dang, I was afraid of that. This may be a Virtualbox bug, then. I'll see if I can dig up any more information.

We could potentially bring back the port randomization as a configurable option, though that kinda sucks having to make people set configuration to work around this.

codekitchen avatar Jun 28 '18 15:06 codekitchen

@codekitchen Short update: out of curiosity I wen't back to Virtualbox. But now I'm able to shutdown/restart the box w/o any nfs port issues. Since it has been a few days, I'm not sure what changed. No update of Virtualbox for sure, but I installed as mentioned xhyve... So I haven't got a Solution or something close, except that now I'm part of the 'works on my machine' group...

m-bymike avatar Jul 02 '18 09:07 m-bymike

I’ve also hit this error on Dinghy 4.6.3, MacOS 10.13.3. Restarting was the only workaround.

alanjc7 avatar Jul 09 '18 12:07 alanjc7

@alanjc7 are you using the Virtualbox backend as well?

codekitchen avatar Jul 09 '18 15:07 codekitchen

This turned out to be an issue with my Firewall config anyway, apologies for the false alarm!

alanjc7 avatar Jul 10 '18 10:07 alanjc7

@alanjc7 I'm encountering this same issue. Can you please say more about what the problem was and how you addressed it?

Ghazgkull avatar Jul 11 '18 19:07 Ghazgkull

I turned on a Firewall which was blocking unfsd connections - removing it allowed me to build without the NFS mounting issues.

alanjc7 avatar Jul 12 '18 10:07 alanjc7

Thanks for the info, @alanjc7. Digging around a bit, this looks like a different issue than the one I and others on my team saw last week when setting up dinghy for the first time. In all our cases, our firewalls were updated to allow unfsd. Using netstat and lsof, we were able to see that port 19091 was being held open by an invalid (dead?) process id. Not knowing how to manually close a port in this state, we each ended up having to reboot our machines to get past the problem.

I wonder if there's something happening during the initial process of creating a new VM (dinghy create --provider virtualbox) which is leaving a dead process holding this port?

Ghazgkull avatar Jul 21 '18 17:07 Ghazgkull

Unfortunately, same issue for me:

bind: Address already in use
Couldn't bind to udp port 19091

Also, my firewall is allowing unfsd. As @Ghazgkull said, the port 19091 is opened by an invalid process id...

k0pernicus avatar Mar 02 '19 12:03 k0pernicus

Same for me!

mvgijssel avatar Mar 28 '19 13:03 mvgijssel

@k0pernicus and @mvgijssel are you both using the Virtualbox back-end as well?

codekitchen avatar Mar 30 '19 16:03 codekitchen

Yes!

mvgijssel avatar Apr 15 '19 14:04 mvgijssel

I used the xhyve back-end. For information, I get back to a GNU/Linux system since a few days... So, unfortunately, I can't test any solution anymore to solve this issue - sorry in advance.

k0pernicus avatar Apr 15 '19 17:04 k0pernicus

@codekitchen I'm using xhyve and getting the same issue as @k0pernicus

Firewall is accepting unfsd

dinghy-NFS.output is telling me

bind: Address already in use
Couldn't bind to udp port 19091

I use $ netstat -vp udp | grep 19091 to get the pid but when I try $ kill 11615 the result is

kill: (11615) - No such process

Haven't found a solution yet.

v-stickykeys avatar Apr 27 '19 19:04 v-stickykeys

Darn. I've tried many times, but I can't reproduce it myself, so I'll need some help debugging the issue.

As I understand it, if the UDP port isn't being freed after the process exits, that means the process is a zombie that hasn't been fully cleaned up, usually either because it has a child process still running, or because the parent process failed to clean up after it.

Because we daemonize unfsd, the parent process is launchd (PID 1), so cleaning up after it should always work. On the other hand, I don't think that unfsd ever spawns child processes, so that rules out both possibilities and it seems likely that I'm missing something crucial. 😬

The next time this happens, can you use that netstat command to find the defunct PID, and then look for any information on the parent or children processes? I think ps -axf | grep <PID> should give relevant info, including parent PID, or you could look in Activity Monitor.

codekitchen avatar Apr 29 '19 16:04 codekitchen

Hi @codekitchen

I got this issue again.

I used this netstat command to get the PID, thanks to this stackoverflow post:

netstat -vanp udp | 
  awk 'BEGIN { l=0 }
       { if (l == 0) { print $0 }
         else if (l == 1) { print $0, "process" }
         else { "ps -o comm " $9 "| tail -1" | getline line;
                print $0, line }
         l++; }'

Unfortunately, I can't retrieve the associated relevant informations using ps... I looked also in Activity Monitor, and still no process :/

k0pernicus avatar Aug 15 '19 17:08 k0pernicus

This might not help but I've been having this issue too, usually happens after waking up my computer from sleeping. Rebooting doesn't fix it, the only thing that does is completely destroying and recreating the image, which as you imagine is not ideal.

My specs: 2017 MacBook Pro running macOS 10.14.3 (18D42)

jordanranson avatar Sep 12 '19 19:09 jordanranson

@jordanranson Can you confirm any errors that appear in your /usr/local/Cellar/dinghy/*/local/var/dinghy-NFS.output? I can't imagine how a bind: Address already in use error could persist across a reboot and require actually destroying and re-creating the image, so I'm wondering if you are seeing the same symptoms but a different underlying cause.

codekitchen avatar Sep 12 '19 22:09 codekitchen

I'm seeing this issue as well, with Mojave 10.4.6 on a 2018 15inch Macbook pro, running dinghy with xhyve driver. This is the first time I'm seeing it, occurred after the macbook crashed with a kernel panic (but that's a separate story). Here is a step by step of what happened, in the hopes this will be helpful for debugging.

Trying to restart the VM:

$ dinghy restart                                                                                                                                             
The VM 'dinghy' does not exist in docker-machine.
Run `dinghy create` to create the VM, `dinghy help create` to see available options.

Okay, try to create it again:

$ dinghy create                                                                                                                                              
Creating the dinghy VM...
Docker machine "dinghy" already exists
/usr/local/Cellar/dinghy/4.6.5/cli/dinghy/system.rb:8:in `system_print': Failure calling `docker-machine create -d xhyve --xhyve-memory-size 2048 --xhyve-cpu-count 4 --xhyve-disk-size 20000 dinghy` (System::Failure)
	from /usr/local/Cellar/dinghy/4.6.5/cli/dinghy/machine.rb:130:in `system_print'
	from /usr/local/Cellar/dinghy/4.6.5/cli/dinghy/machine.rb:16:in `create'
	from /usr/local/Cellar/dinghy/4.6.5/cli/cli.rb:65:in `create'
	from /usr/local/Cellar/dinghy/4.6.5/cli/thor/lib/thor/command.rb:27:in `run'
	from /usr/local/Cellar/dinghy/4.6.5/cli/thor/lib/thor/invocation.rb:126:in `invoke_command'
	from /usr/local/Cellar/dinghy/4.6.5/cli/thor/lib/thor.rb:359:in `dispatch'
	from /usr/local/Cellar/dinghy/4.6.5/cli/thor/lib/thor/base.rb:440:in `start'
	from /usr/local/bin/_dinghy_command:12:in `<main>'

Okay I'll try to remove it then:

$ docker-machine rm dinghy
About to remove dinghy
WARNING: This action will delete both local reference and remote instance.
Are you sure? (y/n): y
Error removing host "dinghy": Unable to find 'xhyve' process by PID: 2729

Here s what I see in my dinghy-NFS.output:

=== Starting NFS at 2019-10-02T16:01:24-04:00 ===

UNFS3 unfsd 0.9.23 (C) 2009, Pascal Schmidt <[email protected]>
/Users/masha: ip 192.168.64.3 mask 255.255.255.255 options 6
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
unable to send RPC reply
...

So I cannot even get a udp port number.

Let me know if this is helpful and if you have ideas. I'll be trying to unstick this but will probably need to re-create the image soon. I'll update if I learn more.

mashalifshin avatar Oct 09 '19 20:10 mashalifshin

Actually that wasn't as bad as I thought it would be. After a restart I was able to successfully run a docker-machine rm dinghy. Hope the debugging info above can help.

mashalifshin avatar Oct 09 '19 20:10 mashalifshin

@mashalifshin thanks for the logs. The unable to send RPC reply lines actually imply that UNFS was running, that's an unrelated warning that it likes to spit out sometimes while syncing files but doesn't seem to affect its working. It's possible that was from an earlier run, but I don't see any bind errors like the previous reports.

The fact that dinghy restart was saying the VM didn't exist when it still did is interesting, the output of docker-machine status dinghy might've indicated a bug in dinghy, we just check if that command succeeds to decide if the machine exists. Overall it sounds like something went wrong with docker-machine-driver-xhyve, they might be interested in those logs especially Error removing host "dinghy": Unable to find 'xhyve' process by PID: 2729. I'm glad a restart cleared it up.

codekitchen avatar Oct 10 '19 16:10 codekitchen