metal-api
metal-api copied to clipboard
Failure to report reinstall by metal-hammer results in fresh install with wiped disks
Apparently it can sometimes happen that an attempt from metal-hammer to report a successful reinstall of a machine fails with an error code 500; in one such instance the machine rebooted and when the metal-hammer next contacted the api was told to do a fresh install, which wiped all local disks.
This is obviously undesired behaviour; how can this issue be fixed?
Error message from metal-hammer ipmi console log when failing to report:
POST /machine/report/eff49e00-6ff4-11e9-8000-efbeaddeefbe HTTP/1.1
Host: 10.255.255.5:4242
User-Agent: Go-http-client/1.1
Content-Length: 416
Accept: application/json
Content-Type: application/json
Accept-Encoding: gzip
{"bootloaderid":"metal-debian","cmdline":"console=ttyS1,115200n8 root=UUID=3cb7a6b6-65e5-4ff7-8dfd-e00bb5abcf1b init=/bin/systemd net.ifnames=0 biosdevname=0 nvme_core.io_timeout=4294967295 systemd.unified_cgroup_hierarchy=0","console_password":"hbPMo74i1dSQXv/F","initrd":"/tmp/initrd.img-5.10.0-6-amd64","kernel":"/tmp/vmlinuz-5.10.0-6-amd64","message":null,"os_partition":null,"primary_disk":null,"success":true}
DBUG[08-04|12:48:46] lldp detectedNeighbor="Name:nbg-w8101-r01leaf01 Desc:Cumulus Linux version 3.7.13 running on Accton AS7712-32X Chassis:Mac:80:a2:35:e1:39:9c Port:Mac:80:a2:35:e1:39:a0" caller=lldpclient.go:71
DBUG[08-04|12:48:46] lldp detectedNeighbor="Name:nbg-w8101-r01leaf02 Desc:Cumulus Linux version 3.7.13 running on Accton AS7712-32X Chassis:Mac:80:a2:35:32:9d:bb Port:Mac:80:a2:35:32:9d:bf" caller=lldpclient.go:71
HTTP/1.1 500 Internal Server Error
Content-Length: 0
Date: Wed, 04 Aug 2021 12:48:48 GMT
EROR[08-04|12:48:48] report error="[POST /machine/report/{id}][500] reportInternalServerError " caller=report.go:42
EROR[08-04|12:48:48] reinstall failed error="unable to report image installation: [POST /machine/report/{id}][500] reportInternalServerError " caller=root.go:140
EROR[08-04|12:48:48] reinstall cancelled => boot into existing OS... reason="unable to report image installation: [POST /machine/report/{id}][500] reportInternalServerError " caller=reinstall.go:25
POST /machine/abort-reinstall/eff49e00-6ff4-11e9-8000-efbeaddeefbe HTTP/1.1
Host: 10.255.255.5:4242
User-Agent: Go-http-client/1.1
Content-Length: 28
Accept: application/json
Content-Type: application/json
Accept-Encoding: gzip
{"primary_disk_wiped":true}
HTTP/1.1 500 Internal Server Error
Content-Length: 27
Content-Type: application/json
Complete log files of failed reinstall followed by fresh install, and for comparison a successful reinstall, are attached.
successful-reinstall.txt failed-reinstall-and-subsequent-install-with-wipe.txt
Here are related metal-api error logs:
{"level":"error","time":"2021-08-04T12:48:48.296Z","caller":"rest/middleware.go:88","message":"Rest Call","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","status":403,"content-length":55,"duration":0.000260831}
{"level":"error","time":"2021-08-04T12:48:48.296Z","caller":"rest/middleware.go:100","message":"cannot get user from request","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","error":"Wrong HMAC found","got":"667c82a5692d7e1013a88b022af0f367e3d349860818ff8714b685bed3ae91e3","want":"4a2e464619017e291fd0113c5a884b89d825a5e8ea8fbd6b1a83eb5f1eab86c8"}
{"level":"error","time":"2021-08-04T12:48:48.274Z","caller":"rest/middleware.go:88","message":"Rest Call","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","useremail":"[email protected]","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","status":422,"content-length":137,"duration":0.087437639}
{"level":"error","time":"2021-08-04T12:48:48.274Z","caller":"service/service.go:96","message":"service error","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","useremail":"[email protected]","operation":"finalizeAllocation","status":422,"error":"the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731","service-caller":"machine-service.go:1635","resp":"the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"}
And these are from metal-core:
2021-08-04T12:48:48.301245+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.298Z#011error#011endpoint/abortReinstall.go:38#011Failed to abort reinstall#011{"app": "metal-core", "statusCode": 500, "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "primary disk already wiped": true, "boot information": null}
2021-08-04T12:48:48.299985+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.298Z#011error#011api/abortReinstall.go:22#011Failed to abort reinstall#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "primary disk already wiped": true, "error": "[POST /v1/machine/{id}/abort-reinstall][403] abortReinstallMachine default Wrong HMAC found (403)"}
2021-08-04T12:48:48.279567+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.276Z#011error#011endpoint/report.go:57#011Unable to report machine back to api.#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "error": "[POST /v1/machine/{id}/finalize-allocation][422] finalizeAllocation default the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"}
2021-08-04T12:48:48.278363+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.275Z#011error#011api/finalizeAllocation.go:27#011Finalize failed#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "error": "[POST /v1/machine/{id}/finalize-allocation][422] finalizeAllocation default the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"}
I tried to reconstruct what happened, here is what found:
- Everything was according to plan, wiped primary disk and reinstalled OS
- Then metal-hammer calls metal-core
/report, returning an error- The metal-core sets the boot order to HD (but it did not work?)
- metal-core calls the metal-api on
/finalize-allocation
- The metal-api sets the
reinstallfield in the machine tofalse - This function then fails when updating the switch entity (probably "was changed by another", original error message was swallowed)
- The metal-hammer calls
/abort-reinstall, returning an error as well- This time the request was declided immediately from the metal-api because HMAC was incorrect (can be caused by system time lags?)
- When this fails, the metal-hammer runs
kexec.Reboot, restarting the machine - The metal-hammer reboots, now finds the
reinstallflag onfalse, thinking it's a usual released machine...
This certainly is a cluster of very unhappy events.
My question would be: Why does the metal-hammer wipe disks when there is an allocation on the machine? I think, this should be the first thing to prevent before starting to wipe disks. Maybe this condition is to specific: https://github.com/metal-stack/metal-hammer/blob/v0.9.1/cmd/root.go#L129.
I'll investigate a little more to see what we can do.