SslConnector thread fails with NullPointerException
LINSTOR version
linstor controller 1.30.4; GIT-hash: bef74a44609cb592c5efad2e707b50e696623c61
Description
Hello folks.
We've identified a scenario in which the linstor-controller service gets stuck in a state in which it can't communicate with satellite nodes. When listing storage pools via linstor storage-pool list, I noticed that all storage pools had a "Warning" state, and I saw many reconnect messages in the command output. After that, I checked the state of the nodes with linstor node list and, to my surprise, all nodes were reported as "Online".
After looking at the logs, I identified that the controller was attempting to reconnect to the satellite nodes, but the reconnect attempts failed with the message Connect request failed - Connector service 'SslConnector' is stopped (see attached error report). After searching for the root cause of the SslConnector failure, I've identified that it failed due to a NullPointerException. After restarting the linstor-controller service, all nodes connected normally as expected.
ErrorReport-67D210F4-00000-000000.log ErrorReport-67D210F4-00000-014804.log
I don't think the null pointer exception is the main problem in this scenario. The main problem seems to be that the SslConnector thread failed for some reason, but the linstor-controller process itself continued its execution without exiting completely or restarting the failed thread. As a result, the controller was stuck in a state in which it was healthy from the POV of external services and supervisors (such as systemd), but it was not operational due to a lack of communication with its satellites.