at_server icon indicating copy to clipboard operation
at_server copied to clipboard

Secondaries don't always restart when certificate is refreshed

Open gkc opened this issue 3 years ago • 4 comments

Describe the bug

  • Secondary server @kryz_9850 was refusing new connections for 2 days 23 hours approximately

Upon investigation - it's up and running, and it's serving the (single) client that is connected to it, which is syncing data every now and again - e.g. like this

d8f88b04-3047-5ec3-ba81-3c65f62764dd_secondary.1.z4dbs5zv8u29@canary-03.us-central1-c.c.secondaries.internal    | INFO|2022-10-08 17:47:07.330780|InboundListener|RCVD: [_61423b20-adbf-4843-a3aa-c1e54c32e5aa] batch:[{"id":1,"command":"update:ttr:-1:dataSignature:HtTviV0vsmRCX9ymjHgCJyTGGFft6DNXfzhmEAVRYJqw50uzLZ+G/bb0gKmgbV10EOx01eSRq+ioRnT16hRRxFml4Y/Gn/ku1wBwc5GkSAB1HL8xArqBMVnS0a74smTnsQo6pyJfbkfww6NISktACYm43QSImSxy1bvXXCNndbAsarOg7TAtOA887wYjZaLEUY4z8KlpRxJTjhm0GUU7WzS3mOEo2mIO4zzJAJDZ9pYjLRF/a1VfEt8qzRKJs6bu4Fq5NxqEeSPoj1gxojGMlhJ3hIsLa8zQZx9G/tL96KL0GCHko22DdMFDRkrpPnMQJAk13vls9g0F1yuYDSCW2Q==:isBinary:false:isEncrypted:false:public:kryz.kryz_9850@kryz_9850 {\"stationName\":\"KRYZ\",\"frequency\":\"98.5 Mhz\",\"ip\":\"192.168.10.12\",\"fanspeed\":\"7307.0\",\"heatsinktemp\":\"35.44\",\"peakmodulation\":\"99.969\",\"poweroutput\":\"55.893\",\"powerreflected\":\"3.232\",\"date\":\"10/8 10:46:13 AM\",\"swr\":\"1.625\"}"}] 

The server is not accepting new connections, and the logs look like this:

INFO|2022-10-08 17:45:02.754738|AtSecondaryServer|Server cannot accept connections now. 
INFO|2022-10-08 17:45:06.073043|AtCertificationValidation|Waiting for total number of active connections to 0 or equal to number of monitor connections 
INFO|2022-10-08 17:45:06.073125|AtCertificationValidation|Total number of monitor connections are 0 
INFO|2022-10-08 17:45:06.073150|AtCertificationValidation|Total number of active connections are 1 

Looking at the code in _initializeRestartProcess in at_certificate_validation.dart, this will continue forever until either (1) number of connections == 0 or (2) number of monitor connections == number of connections

[ (2) didn't initially make sense to me but basically it's saying, monitor connections don't matter, we can shut down if all we've got is monitor connections ]

Also - there is a server parameter 'forceRestart' which, when true, ignores all of this logic, and will just restart the server when the certificate expiry job runs and detects that the cert has expired. However, by default this server parameter is false.

On the one hand, it's great that both the client and the secondary have been stable for such a long time (many days, looking at the Production Issues history). On the other hand, it's not great that this results in the server becoming uncontactable, indefinitely. So we should change the behaviour.

Expected behavior Server should restart gracefully, but promptly - for example

  • Stop accepting new connections
  • Stop accepting new commands / verbs on existing connections
  • Finish processing whatever verbs / commands are being processed, and send the responses
  • Restart

gkc avatar Oct 08 '22 18:10 gkc

Additional problem: A manual restart of the secondary does not prevent the main issue from recurring as the 'restart' file, which is created by the cert renewal job and indicates to the secondary server that it needs to restart, is only deleted by the _initializeRestartProcess function. i.e. it is not removed upon server startup.

gkc avatar Oct 09 '22 13:10 gkc

@gkc I will work with you to brin storm potential solutions to this problem.

VJag avatar Oct 10 '22 03:10 VJag

Looks like this is what I was seeing with #899 and didn't find recurrence of.

cpswan avatar Oct 10 '22 09:10 cpswan

No progress in PR48

gkc avatar Oct 30 '22 16:10 gkc

@cconstab FYI I looked into server triggering clients to renegotiate TLS connection ... decided against it because (1) it seems to be generally considered unsafe to do it via the socket itself, and (2) adding in support into the protocol itself would have been a lot of work for something we already have a much easier solution for (restart!)

#1131 will be ready for review soon

gkc avatar Jan 09 '23 01:01 gkc

Burned 4SP in PR52. Moving to PR53 and re-estimating as 2SP to complete the tests and get PR reviewed

gkc avatar Jan 09 '23 11:01 gkc

Very little progress in PR53 due to other priorities, most notably APKAM draft detailed design

gkc avatar Jan 23 '23 15:01 gkc

Implemented and added unit tests a long time ago. Added functional tests during PR54. PR is ready for review

gkc avatar Feb 06 '23 15:02 gkc