playground
playground copied to clipboard
Can't complete `docker_compose` quickstart guide
Greetings,
I am trying to "play" with Tinkerbell to provision bare metal servers with OSes (e.g with Ubuntu Focal). I am following https://github.com/tinkerbell/sandbox/blob/main/docs/quickstarts/COMPOSE.md and I haven't been able to finish the provisioning steps -I've re-tried the steps 5 times in a row, and saw 2 different outcomes-. On my first 3 tries, Boots can't recognize the DHCP request and logs the info written below. On my 4th try, it picked it up, but the workflow's action state stayed/stuck in STATE_PENDING
state -I've left it like that for 2 hours, and I think that is long enough time for it to at least start working-. Then I tried it one more time, and it didn't get pick up by Boots like the first 3 tries.
Any recommendation/tips are welcome. If you have more known ways of making Tinkerbell work -a bare metal server provisioning abother one with an OS-, I am also willing to give a try to them.
Expected Behaviour
I am expecting to see similar outcome for the steps described in https://github.com/tinkerbell/sandbox/blob/main/docs/quickstarts/COMPOSE.md
Current Behaviour
Either Boots doesn't pick up the machine, or it gets picked up but the workflow stays in 0%, STATE_ENDING
stage.
- Boots doesn't recognize the DHCP request. For
# echo $TINKERBELL_CLIENT_MAC e4:43:4b:3d:75:b8
, I encountered the following output in Boots logs and machine doesn't picked up by Tinkerbell stack;
boots_1 | {"level":"info","ts":1649949865.6316814,"caller":"[email protected]/handler.go:105","msg":"","service":"github.com/ti
nkerbell/boots","pkg":"dhcp","pkg":"dhcp","event":"recv","mac":"e4:43:4b:3d:75:b8","via":"0.0.0.0","iface":"eno1","xid":"\"4b:3d:75:b8\"","type":"DHCPDISCOVER","secs":28} boots_1 | {"level":"info","ts":1649949865.6318014,"caller":"boots/dhcp.go:78","msg":"parsed option82/circuitid","service":"github.com/tinkerbell/boots","
pkg":"main","mac":"e4:43:4b:3d:75:b8","circuitID":""} boots_1 | {"level":"info","ts":1649949865.6336043,"caller":"boots/dhcp.go:91","msg":"retrieved job is empty","service":"github.com/tinkerbell/boots","pkg
":"main","type":"DHCPDISCOVER","mac":"e4:43:4b:3d:75:b8","err":"discover from dhcp message: get hardware by mac from tink: rpc error: code = Unknown desc = SELECT: sql: no rows in result set","errVerbose":"rpc error: code = Unknown desc = SELECT: sql: no rows in result set\nget hardware by mac from tink\ngithub.com/tinkerbell/boots/packet.(*client).DiscoverHardwareFromDHCP\n\t/opt/actions-runner/_work/boots/boots/packet/endpoints.go:108\ngithub.com/tinkerbell/boots/job.discoverHardwareFromDHCP.func1\n\t/opt/actions-runner/_work/boots/boots/job/fetch.go:17\ngithub.com/golang/groupcache/singleflight.(*Group).Do\n\t/home/github/go/pkg/mod/github.com/golang/[email protected]/singleflight/singleflight.go:56\ngithub.com/tinkerbell/boots/job.discoverHardwareFromDHCP\n\t/opt/actions-runner/_work/boots/boots/job/fetch.go:19\ngithub.com/tinkerbell/boots/job.CreateFromDHCP\n\t/opt/actions-runner/_work/boots/boots/job/job.go:106\nmain.dhcpHandler.serveDHCP\n\t/opt/actions-runner/_work/boots/boots/cmd/boots/dhcp.go:89\nmain.dhcpHandler.ServeDHCP.func1\n\t/opt/actions-runner/_work/boots/boots/cmd/boots/dhcp.go:50\ngithub.com/gammazero/workerpool.startWorker\n\t/home/github/go/pkg/mod/github.com/gammazero/[email protected]/workerpool.go:218\nruntime.goexit\n\t/opt/actions-runner/_work/_tool/go/1.16.3/x64/src/runtime/asm_amd64.s:1371\ndiscover from dhcp message"}
- Workflow doesn't progress. On my 4th try, the machine got picked up by Boots, but then the workflow got "stuck", and it was visible during
Step 6
of the linked guide above;
Every 1.0s: tink workflow events c263defc-c0b1-11ec-9ab9-0242ac130006; tink workflow state c263defc-c0b1-11ec-9ab9-0242ac130006
+-----------+-----------+-------------+----------------+---------+---------------+
| WORKER ID | TASK NAME | ACTION NAME | EXECUTION TIME | MESSAGE | ACTION STATUS |
+-----------+-----------+-------------+----------------+---------+---------------+
+-----------+-----------+-------------+----------------+---------+---------------+
+----------------------+--------------------------------------+
| FIELD NAME | VALUES |
+----------------------+--------------------------------------+
| Workflow ID | c263defc-c0b1-11ec-9ab9-0242ac130006 |
| Workflow Progress | 0% |
| Current Task | |
| Current Action | |
| Current Worker | |
| Current Action State | STATE_PENDING |
+----------------------+--------------------------------------+
On the KVM screen, this screenshot was visible -on the 4th run when Boots was able to pick up the server's request-, and during that 2 hours it didn't change;
Possible Solution
N/A
Steps to Reproduce (for bugs)
- Follow the https://github.com/tinkerbell/sandbox/blob/main/docs/quickstarts/COMPOSE.md guide.
- Process gets stuck on Step 6, either Boots can't recognize the DHCP request being sent by the defined MAC address in Step 3 or it gets picked up, but the workflow doesn't progress and provision the machine.
Context
Your Environment
- Operating System and version (e.g. Linux, Windows, MacOS):
Ubuntu 20.04.4 LTS
- How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Bare metal provisioner, trying to provision another bare metal server with docker-compose
method outlined in https://github.com/tinkerbell/sandbox/blob/main/docs/quickstarts/COMPOSE.md
- Link to your project or a code example to reproduce issue:
N/A
I can confirm that I found similar behavior, I'm running on Ubuntu 21.10 , docker 20.10.14 with docker-compose 2.4.1. In my testing Boots picked up each request though but the workflow never starts after booting in the LinuxKit environment. I verified that the hardware config, template and workflow are in place with the correct details. I found the following error in the docker registry log:
compose-web-assets-server-1 | 10.180.5.155 - - [21/Apr/2022:09:47:29 +0000] "GET /workflow/ca.pem HTTP/1.1" 200 623 "-" "Go-http-client/1.1" "-"
compose-registry-1 | 10.180.5.87 - - [21/Apr/2022:09:47:30 +0000] "GET / HTTP/1.1" 200 0 "" "Wget"
compose-registry-1 | 2022/04/21 09:47:32 http: TLS handshake error from 10.180.5.155:41030: remote error: tls: bad certificate
Hi @digi-talo and @varilci, did you tried the Vagrand and VirtualBox Tutorials too? I have the exact same issues there :D And could not figure it out why... Regarding to https://github.com/tinkerbell/boots/issues/125 this should be fixed.
Edited: This behaviour could be replicated with the Vagrand and libvirt Quickstart too.
@varilci at that kvm console, can you check the tink-worker logs please:
(press enter, there IS a prompt "hidden" with the kernel message along IPVS)
# ctr -n services.linuxkit task exec -t --exec-id testing docker sh
# cd /var/lib/docker/containers/*
# less *json
Is there an X509 unable to verify message?
@double-p I took the liberty to check myself but found a different error:
yeah, that's the one. What I found so far, that this is likely related to https://github.com/tinkerbell/tink/pull/584
basically the whole thing wont work with "unofficial"/selfsigned certs any longer.
In sandbox
I did some dial backs, but I think using tink-worker:sha-453f0fda should already do the trick.
I gave it a shot with tink-worker:sha-453f0fda
but that doesn't seem to take me any further :disappointed:
Ok, the "full" change to make my sandbox-vagrant-vbox setup going is with these hashes and a change on the Boots cmdline, since they changed somewhen around that. Might be more than needed, but "it works for me". https://gist.github.com/double-p/ea7597da76956fac6f90251ad2e2f175
Hi @double-p, your setup for VBox and Vagrant works fine 👍 Only provisioning with libvirt does not work... Trying libvirt the same error as @varilci mentioned occur. Very strange behavior... But thanks a lot for you help.
Edit:
The only problem is now, if you try to reload the machine via vagrant reload machine1
it stucks at Initializing XFRM netlink socket
again. And ssh via ssh [email protected]
does not work too.
There's a second reason this can hang (just when hook tries to download tink-worker) and that's along "ca.pem". Remove that after failed tries before starting the next up/provision of the "provisioner".
ssh from the provisioner will not work before you also see a normal getty/login screen on the machine1 GUI, too
@double-p thanks for all the help. I managed the provision a VM once with the docker-compose setup but I can't seem to get this to work reliably.
It just hangs at:
While I notice the following in the docker logs:
web-assets-server_1 | 10.180.5.125 - - [02/May/2022:09:13:19 +0000] "GET /workflow/ca.pem HTTP/1.1" 200 623 "-" "Go-http-client/1.1" "-"
registry_1 | time="2022-05-02T09:13:22.311102921Z" level=warning msg="error authorizing context: basic authentication challenge for realm "Registry Realm": invalid authorization credential" go.version=go1.11.2 http.request.host=10.180.5.87 http.request.id=8a452cdb-edfe-455c-bd6f-4acff898c547 http.request.method=GET http.request.remoteaddr="10.180.5.125:48606" http.request.uri="/v2/" http.request.useragent="docker/19.03.8 go/go1.12.17 git-commit/afacb8b7f0 kernel/5.10.57-linuxkit os/linux arch/amd64 UpstreamClient(Go-http-client/1.1)
You mention:
There's a second reason this can hang (just when hook tries to download tink-worker) and that's along "ca.pem". Remove that after failed tries before starting the next up/provision of the "provisioner".
Where do you exactly remove the ca.pem
?
Hmm, can't tell why/where the 2nd attempt hangs at. Check logs of tink-docker again.
ca.pem is in deploy/compose/state/webroot/workflow
after the initial provisioning - only delete it for a fresh start, not a subsequent deployment
Not sure what config file this refers to?
That's an OPTIONAL config (go viper) of tink-worker, since all that (as I understand) via ENV and cmdline-ops, it's not needed and not really an error. Basically the worker is started (you should see something about that in tink-server logs) and waits to get a workflow assigned and work it.
Basically, no more certificate errors and also no networking issues (there would be a connection error or timeout after "starting" in that log).
so.. check tink-server and 'tink workflow get' and so on
No idea what changed but I tried it again today on a clean environment and I'm stuck again with the bad certificate. The changes of @double-p no longer seem to do the trick :disappointed:
the sha-references are immutable so that's very weird. cannot try before weekend since @ kubecon
Yeah you are right, I found that you'll still get the bad cert error if the systems clock deviates too much of the actual time.
Ah.. that was "fixed" in a newer version -- which just introduced other problems.
You can also adapt that in deploy/compose/generate-tls-certs/ca-{config,csr}.json
if a clock fix is difficult