habitat
habitat copied to clipboard
v0.56.0 [Err: 4] secret key mismatch on sup restart studio
I can't replicate this 100% of the time, however I have run into it on a couple of occasions
Ubuntu 16.04 Linux 4.4.0-124 Chroot Studio
I experienced this in the builder dev environment. I entered a fresh chroot based studio. On loading in, I ran hab pkg install results/<origin>-<packagename>.hart
which installed successfully. However for some reason this caused the supervisor to exit. I restarted the supervisor and attempted to get it's status before continuing but was met with the referenced error.
* Install of habitat/builder-minio/2018-05-11T00-2924Z/20180605190642 complete with 2 new packages installed.
[1]+ Done hab sup run "$@" > /hab/sup/default/sup.log 2>&1 (wd: /)
(wd now: /src)
[2][default:/src:0]# sup-run
[3][default:/src:0]# hab sup status
XXX
XXX [Err: 4] secret key mismatch
XXX
[4][default:/src:0]# hab svc status
XXX
XXX [Err: 4] secret key mismatch
XXX
At this point I attempted to term the supervisor and was met with another error
[5][default:/src:0]# hab sup term
hab-sup(MR)[components/sup/src/manager/mod.rs:314:64]: Failed to send a signal to the child process
I haven't had a chance to dig into this yet but when I experienced this yesterday there were definitely child processes that stayed running and had to be kill -9
'd from the host.
In attempting to exit the studio I see further errors.
[6][default:/src:0]# exit
logout
kill: can't kill pid 6171: No such process
Warning: '/hab/pkgs/core/hab-studio/0.56.0/20180530235913/libexec/busybox kill 6171' failed with status 1
Checking ps auxf
from the host afterwards shows the launcher, the sup, and a process I had attempted to start inside the studio still running on the host. These also had to be kill -9'd
It looks like the issue here is that communication with the supervisor now requires a shared secret. By default, this lives in hab/sup/default/CTL_SECRET
(see https://www.habitat.sh/blog/2018/05/changes-in-the-0.56.0-supervisor/).
However, what can be confusing is that when run inside a studio, there can be a separate CTL_SECRET
for each studio instance. For example, in the repro I was looking at there was an instance here: /hab/studios/home--hab--habitat_jumpstart--national-parks--habitat/hab/sup/default/CTL_SECRET
.
I was able to determine which supervisor was running by looking for the lock file:
hab$ sudo find /hab/ -path "*/sup/default/LOCK"
/hab/studios/home--hab--habitat_jumpstart--national-parks--habitat/hab/sup/default/LOCK
We can confirm it's still running:
hab$ ps -p $(cat /hab/studios/home--hab--habitat_jumpstart--national-parks--habitat/hab/sup/default/LOCK)
PID TTY TIME CMD
20284 pts/0 00:00:02 hab-launch
This tells me that the running supervisor was started from a studio with a root at
~/habitat_jumpstart/national-parks/habitat
Which is probably an error, as the studio root should've been
~/habitat_jumpstart/national-parks
If we run any supervisor commands other than inside the ~/habitat_jumpstart/national-parks/habitat
studio, we'll get the error:
~
hab$ hab sup status
✗✗✗
✗✗✗ [Err: 4] secret key mismatch
✗✗✗
But if we supply the supervisor secret from the ~/habitat_jumpstart/national-parks/habitat
, it works:
hab$ HAB_CTL_SECRET=$(sudo cat /hab/studios/home--hab--habitat_jumpstart--national-parks--habitat/hab/sup/default/CTL_SECRET) hab sup status
No services loaded.
I'll look at improving the error message and documentation.
Unless I'm missing something, which is totally possible, I don't think this is a documentation bug. The studios on linux have guaranteed behavior of running multiples in any directory including recursive structures including single steps apart on those dir structures. E.G. This behavior worked before:
- cd /foo/bar
- hab studio enter
- hab svc status
- exit
- cd /foo
- hab studio enter
- hab svc status
No hab studio rm
-ing nothing, you should be able to enter and manipulate those studios. You should be able to exit one, and jump back up to the other in fact.
Is what youre seeing other studios getting mounted nested into each other? Or is it that when you exit one of those studios the supervisor process doesn't get terminated correctly?
I can confirm @eeyun's comment that the steps he listed definitely worked before 0.56. From a user's point of view, no one should care about shared secrets or anything else. It should Just Work™.
We probably want to have some sort of escape mechanism for the Supervisor while in the Studio which disables the secret comparison in the connection handshake. It's really not necessary to authenticate you since it's a contained dev environment. I think that'd resolve this one
This works for me on 0.56.0:
08:08:13 AM jbauman@ubuntu:~
➤ cd foo/bar/
08:08:16 AM jbauman@ubuntu:~/f/bar
➤ hab studio enter
…
[1][default:/src:0]# hab svc status
No services loaded.
[2][default:/src:0]# exit
logout
08:08:31 AM jbauman@ubuntu:~/f/bar
hab studio enter ran for 6723 ms
➤ cd ~/foo
08:08:45 AM jbauman@ubuntu:~/foo
…
[1][default:/src:0]# hab svc status
No services loaded.
[3][default:/src:0]# exit
logout
08:10:36 AM jbauman@ubuntu:~/foo
➤ hab -V
hab 0.56.0/20180530234036
@eeyun: is there anything more to do with this issue? Please re-open if so.
If the original bug of studios getting confused about having the appropriate CTL_SECRET still exists, then this is still an issue. There shouldn't be a circumstance where our users can end up in a state where the studio is unusable. If the supervisor in the studio dies and can't be restarted due to a different studio secret existing in the filesystem then that is a bug.
Ah, ok. I think I've got it now. I had the repo wrong. Here's how I can get the behavior that I think you're referring to:
➤ mkdir /tmp/{foo,bar}
➤ hab studio -r /tmp/foo/ enter
…
[1][default:/src:0]# hab sup status
No services loaded.
Then in another shell (without exiting the studio rooted at /tmp/foo/
:
➤ hab studio -r /tmp/bar/ enter
…
[1][default:/src:0]# hab sup status
✗✗✗
✗✗✗ [Err: 4] secret key mismatch
✗✗✗
[1]+ Done hab sup run "$@" > /hab/sup/default/sup.log 2>&1 (wd: /)
(wd now: /src)
Is that right? Thanks for following up, @eeyun.
I think I've gotten to the bottom of this one, finally.
If you enter a studio when there's already a supervisor running outside that studio but on the same host, the second supervisor crashes (note the [1]+ Done hab sup run "$@" > /hab/sup/default/sup.log 2>&1 (wd: /)
output above) with a bind failure:
hab-sup(ER)[components/sup/src/error.rs:450:9]: Butterfly error: Cannot bind to port: Os { code: 98, kind: AddrInUse, message: "Address already in use" }
Then, with no supervisor running inside the studio, any supervisor commands such as hab sup status
will attempt to connect to whatever supervisor is running: hence the key mismatch.
We need to do two things here:
- Communicate to the user when the supervisor fails to start
- Find a way to allow multiple supervisors to run on the same host conveniently or at least make it clear why the supervisor can't run and what to do about it
(1) would just be a matter of checking the exit code of hab sup run
(called by the sup-run
script helper which the studio executes by default on enter
), but that exit code is determined by the launcher, which exits 0 even if the supervisor exits ERR_NO_RETRY_EXCODE
(86). I have a fix for that working and should have a PR up shortly.
(2) Requires a bit of discussion about how we want the user experience to work. I'll file a separate issue for that and link back here.
Yeah, we need to do a bit of UX thinking about this. @ryankeairns and @fnichol will likely have some good input.
Just let me know the time and place :)
Hello, I am getting stuck because of the above issue and even logged a ticket here: https://forums.habitat.sh/t/err-4-secret-key-mismatch/1210
Can someone pls help how do I get rid of this. I am using hab 0.83.0/20190712231714
I'm also seeing this. Could the use of the --reuse
flag when building be a possible culprit?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.