Secondaries don't always restart when certificate is refreshed
Describe the bug
- Secondary server
@kryz_9850was refusing new connections for 2 days 23 hours approximately
Upon investigation - it's up and running, and it's serving the (single) client that is connected to it, which is syncing data every now and again - e.g. like this
d8f88b04-3047-5ec3-ba81-3c65f62764dd_secondary.1.z4dbs5zv8u29@canary-03.us-central1-c.c.secondaries.internal | INFO|2022-10-08 17:47:07.330780|InboundListener|RCVD: [_61423b20-adbf-4843-a3aa-c1e54c32e5aa] batch:[{"id":1,"command":"update:ttr:-1:dataSignature:HtTviV0vsmRCX9ymjHgCJyTGGFft6DNXfzhmEAVRYJqw50uzLZ+G/bb0gKmgbV10EOx01eSRq+ioRnT16hRRxFml4Y/Gn/ku1wBwc5GkSAB1HL8xArqBMVnS0a74smTnsQo6pyJfbkfww6NISktACYm43QSImSxy1bvXXCNndbAsarOg7TAtOA887wYjZaLEUY4z8KlpRxJTjhm0GUU7WzS3mOEo2mIO4zzJAJDZ9pYjLRF/a1VfEt8qzRKJs6bu4Fq5NxqEeSPoj1gxojGMlhJ3hIsLa8zQZx9G/tL96KL0GCHko22DdMFDRkrpPnMQJAk13vls9g0F1yuYDSCW2Q==:isBinary:false:isEncrypted:false:public:kryz.kryz_9850@kryz_9850 {\"stationName\":\"KRYZ\",\"frequency\":\"98.5 Mhz\",\"ip\":\"192.168.10.12\",\"fanspeed\":\"7307.0\",\"heatsinktemp\":\"35.44\",\"peakmodulation\":\"99.969\",\"poweroutput\":\"55.893\",\"powerreflected\":\"3.232\",\"date\":\"10/8 10:46:13 AM\",\"swr\":\"1.625\"}"}]
The server is not accepting new connections, and the logs look like this:
INFO|2022-10-08 17:45:02.754738|AtSecondaryServer|Server cannot accept connections now.
INFO|2022-10-08 17:45:06.073043|AtCertificationValidation|Waiting for total number of active connections to 0 or equal to number of monitor connections
INFO|2022-10-08 17:45:06.073125|AtCertificationValidation|Total number of monitor connections are 0
INFO|2022-10-08 17:45:06.073150|AtCertificationValidation|Total number of active connections are 1
Looking at the code in _initializeRestartProcess in at_certificate_validation.dart, this will continue forever until either (1) number of connections == 0 or (2) number of monitor connections == number of connections
[ (2) didn't initially make sense to me but basically it's saying, monitor connections don't matter, we can shut down if all we've got is monitor connections ]
Also - there is a server parameter 'forceRestart' which, when true, ignores all of this logic, and will just restart the server when the certificate expiry job runs and detects that the cert has expired. However, by default this server parameter is false.
On the one hand, it's great that both the client and the secondary have been stable for such a long time (many days, looking at the Production Issues history). On the other hand, it's not great that this results in the server becoming uncontactable, indefinitely. So we should change the behaviour.
Expected behavior Server should restart gracefully, but promptly - for example
- Stop accepting new connections
- Stop accepting new commands / verbs on existing connections
- Finish processing whatever verbs / commands are being processed, and send the responses
- Restart
Additional problem: A manual restart of the secondary does not prevent the main issue from recurring as the 'restart' file, which is created by the cert renewal job and indicates to the secondary server that it needs to restart, is only deleted by the _initializeRestartProcess function. i.e. it is not removed upon server startup.
@gkc I will work with you to brin storm potential solutions to this problem.
Looks like this is what I was seeing with #899 and didn't find recurrence of.
No progress in PR48
@cconstab FYI I looked into server triggering clients to renegotiate TLS connection ... decided against it because (1) it seems to be generally considered unsafe to do it via the socket itself, and (2) adding in support into the protocol itself would have been a lot of work for something we already have a much easier solution for (restart!)
#1131 will be ready for review soon
Burned 4SP in PR52. Moving to PR53 and re-estimating as 2SP to complete the tests and get PR reviewed
Very little progress in PR53 due to other priorities, most notably APKAM draft detailed design
Implemented and added unit tests a long time ago. Added functional tests during PR54. PR is ready for review