eve
eve copied to clipboard
[Draft] Change logging level to error from fatal
The hwinfo collection and pushing to controller is a periodic task and if we get an error in data collection once that does not mean it's a fatal issue and crash the entire node for that. Moreover we need system to be up and running to debug issues in such scenarios. So changing the logging level to error from fatal.
Signed-off-by: Pramodh Pallapothu [email protected]
@eriknordmark the problem in this case that proto.Marshal is failing. Now if that happens at upgrade time sure the system will fallback, but say system is up and running for sometime and suddenly its started to hit the issue, may be bad hardware and unable to fetch details. For some reason in this case customer system suddenly started to go in this loop of crash.
I modified the PR as Draft so that we can give this to customer and keep the system up and running for debug purposes. I will amend this commit to add more tracing information.
@zedi-pramodh are we proceeding with this work? I don't know what the result was for the testing on the particular device.
@zedi-pramodh can we close this?
Customer did not see this issue anymore. But discussing with Erik we decided to put this patch in to catch any future issues. It's a safe patch, mostly logging when error occurs. I will refresh my branch, retest and update the PR.
@zedi-pramodh by any chance did you retest this PR? Do you plan to work on it? Or we can close it and reopen once again if there is a need.
I did not get chance to work on it nor the customer is seeing any issues, will close this PR and reopen when I get to it.