balena-supervisor
balena-supervisor copied to clipboard
Improve logging when device bootstrap fails
In the following journal logs the Supervisor doesn't provide any info about why the device bootstrap process failed:
Sep 06 12:14:28 09e71b1 resin-supervisor[2254]: [[34minfo[39m] New device detected. Provisioning...
Sep 06 12:13:58 09e71b1 resin-supervisor[2254]: [[36mevent[39m] Event: Device bootstrap failed, retrying {"delay":30000,"error":{"message":""}}
One case that can be tested is trying to bootstrap a device with deviceType that the API does not know. This should cause an error to be thrown and we should test if it's correctly logged (from experience it isn't).
[cywang117] This issue has attached support thread https://jel.ly.fish/569def9b-1dc7-4cdc-9bda-bc8ec5bed3f3
These logs should also surface in the dashboard not just in journal
From the linked JF ticket above, the user has a custom board similar to the fincm3 that they're testing out balena OS with. They are doing the following:
- Build image with:
docker run --rm -it --privileged -v <repo-path>:/home/build aggurio/docker-balena ./balena-yocto-scripts/build/barys -d --machine CUSTOM_DEVICE_TYPE --rm-work - Configure image with:
balena os configure <image> --app <balena application> --type fincm3 --debug --version 2.72.0 --config-network ethernet - Flash image with
dd if=<image> of=/dev/sdb bs=4M conv=fsync,balena os initialize <image> --type fincm3 --drive /dev/sdb, or Etcher.
The Supervisor will then log the error. The possible cause is as mentioned above: the API does not recognize the custom DT which is reported during provision even though the OS's set deviceType is fincm3.
The user investigated further and found that config.json is set to the custom DT after failed provision, however if they set device-type.json to fincm3, the device will provision successfully.
Update: I've tested the following fields related to device provisioning:
try {
device = await Bluebird.resolve(
deviceRegister.register({
applicationId: opts.applicationId,
uuid: opts.uuid,
deviceType: opts.deviceType,
deviceApiKey: opts.deviceApiKey,
provisioningApiKey: opts.provisioningApiKey,
apiEndpoint: opts.apiEndpoint,
supervisorVersion: opts.supervisorVersion,
osVersion: opts.osVersion,
osVariant: opts.osVariant,
macAddress: opts.macAddress,
}),
).timeout(opts.apiTimeout);
} catch (err) {
to determine what type of wrong bootstrap data being sent could cause the error message to be unclear (Event: Device bootstrap failed, retrying {"delay":30000,"error":{"message":""}}). The only field that results in this error message is the apiTimeout field.
The error message is not related to incorrect deviceType reporting, since an invalid or custom device type that the API doesn't recognize will throw an error for all the following cases:
- Unrecognized device type:
"error":{"message":"Unknown device type <DEVICE_TYPE>"} - Undefined device type:
"error":{"message":"Options must contain a 'deviceType' entry."} - And finally, a different device type than what the device actually is will provision successfully. Tested by provisioning an RPi4 as a balenaFin device type.
Therefore, behavior observed by the user during their testing of a custom device type might actually be unrelated to editing device-type.json. There could be a number of reasons for a device bootstrap to timeout, with unstable network being the most likely. From a Supervisor point of view, we can log more verbosely that an apiTimeout has occurred, but more investigation on an erroring device is required to pinpoint why the timeout has occurred.
EDIT: Also, it's strange that device registration would lag for 15 minutes, since that is the default apiTimeout on the Supervisor as of v12.10.x.
A case of Event: Device bootstrap failed, retrying {"delay":30000,"error":{"message":""}} appearing repeated during device provision is related to #1787. In this instance, the Supervisor is failing to provision because Supervisor attempts calls with IPv6 which do not succeed (see issue for details). We are looking into fixing this issue.
I have been getting that nondescript error message while using a .local address for open-balena
The .local address itself resolves fine and can be curl'ed from both the host OS and the balena_supervisor. This is on balena/open-balena-api:v0.209.2
Switching from .local to e.g. .lan and setting up a DNS server fixes the problem.
Are there plans to support .local in the balena_supervisor or is there simply something broken?