aedes icon indicating copy to clipboard operation
aedes copied to clipboard

fix: send LWT on broker restart

Open friesendrywall opened this issue 2 years ago • 25 comments

This will require companion fixes on persistences, see aedes-persistence-redis #104 for example.

The core issue is that clients state is not stored in any way, so these wills get orphaned in the db whenever the server restarts.

friesendrywall avatar Jul 16 '22 18:07 friesendrywall

Pull Request Test Coverage Report for Build 2682867428

  • -5 of 7 (28.57%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.7%) to 99.128%

Changes Missing Coverage Covered Lines Changed/Added Lines %
aedes.js 2 7 28.57%
<!-- Total: 2 7
Totals Coverage Status
Change from base Build 2337078913: -0.7%
Covered Lines: 810
Relevant Lines: 815

💛 - Coveralls

coveralls avatar Jul 20 '22 08:07 coveralls

Pull Request Test Coverage Report for Build 2724226001

  • 6 of 7 (85.71%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 99.564%

Changes Missing Coverage Covered Lines Changed/Added Lines %
aedes.js 6 7 85.71%
<!-- Total: 6 7
Totals Coverage Status
Change from base Build 2337078913: -0.3%
Covered Lines: 814
Relevant Lines: 815

💛 - Coveralls

coveralls avatar Jul 20 '22 08:07 coveralls

I'm not sure if this will meet the standard, perhaps it is somewhat of a kludge, but see the new "delivers old will in case of a restart on the same brokerId" test, I don't know right off how to get this to pass without using the sendWillsOnStartup flag. Some other way to address this issue would be ok, but I'm at my limit of time on this.

Other notes, am I missing something, or would the present method of checkAndPublish not bypass packet authorization?

friesendrywall avatar Jul 23 '22 16:07 friesendrywall

@mcollina What do you think about this? I cannot find anything in specs about how to handle broker restarts properly, but IMO it makes sense to send will on startup, otherwise we would have lot of orphans in persistence (the alternative would be to just delete them)

robertsLando avatar Jul 25 '22 07:07 robertsLando

Other notes, am I missing something, or would the present method of checkAndPublish not bypass packet authorization?

@friesendrywall correct, checkAndPublish should also call authorize publish, that's another error, could you add that please? Also that one should be covered with a test

robertsLando avatar Jul 25 '22 07:07 robertsLando

About the checkAndPublish authorize, what is the correct way? There is no complete client object to pass, it would only be able to pass the clientId from the packet. Should this rather be authorized elsewhere? Maybe this should be a separate pull request.

friesendrywall avatar Jul 25 '22 13:07 friesendrywall

You right, without client id it wouldn't work.. IMO the best solution would be to autohrize will on connect and if auth fails just remove it and don't store it at all

robertsLando avatar Jul 25 '22 15:07 robertsLando

@robertsLando, @friesendrywall

  1. from MQTT 3.11 spec Section 3.1.2.5

Situations in which the Will Message is published include, but are not limited to: An I/O error or network failure detected by the Server. The Client fails to communicate within the Keep Alive time. The Client closes the Network Connection without first sending a DISCONNECT Packet. The Server closes the Network Connection because of a protocol error.

The Server SHOULD publish Will Messages promptly. In the case of a Server shutdown or failure the server MAY defer publication of Will Messages until a subsequent restart. If this happens there might be a delay between the time the server experienced failure and a Will Message being published.

  1. From HiveMQ doc,

The Last Will and Testament feature provides a way for clients to respond to ungraceful disconnects in an appropriate way.

  1. From a local vernemq dev env, restarting a server will not trigger LWT. Instead a client keeps sending CONNECT to resume its connection in best effort within its keep alive period.
  2. If the persistence is a memory storage layer, server can't send LWT if sendWillsOnStartup is set to TRUE. It will make confusion in different outcomes in a persistence point of view.

IMO, sending LWT is on a client state perspective while sendWillsOnStartup is in server perspective that violates what ungraceful disconnection stated in above. What I suggested is to restore all client previous state after restart, and let aedes sends LWT itself if a client does not respond within a keep alive period.

gnought avatar Jul 26 '22 17:07 gnought

What I suggested is to restore all client previous state after restart, and let aedes sends LWT itself if a client does not respond within a keep alive period.

What do you mean with this in practise? By reading docs they just say that in case of server shutdown or failure the LWT should be sent on restart so IMO the sendWillsOnStartup should be considered true by default, the behaviour introduced by it it's spec compliance

robertsLando avatar Jul 27 '22 06:07 robertsLando

@robertsLando will the LWT be suppressed in cluster or HA mode when sendWillsOnStartup is TRUE?

gnought avatar Jul 27 '22 09:07 gnought

will the LWT be suppressed in cluster or HA mode when sendWillsOnStartup is TRUE

Yeah you right, in case of a cluster this would not work, so what you mean with 'restore all client previous state after restart,' ? How should we do that? Recreate all clients for each LWT we found associated with our broker id?

robertsLando avatar Jul 27 '22 09:07 robertsLando

I'm a little non hopeful anyone will like this commit, its close to the point of over and out and perhaps just patching this by publishing to a wildcard to override this behaviour.

I don't really have a sufficient in depth understanding of the persistence layer to modify it at this point, but it really needs it to correctly make this work because it needs to have most of the original client data to manage this smoothly. For example, it seems like it really should track the keepAlive interval and honor that. The simplest option that would require no persistence layer change would be to add extra data to the packet, but that probably has unintended consequences.

Also, I created a new bug report for the non auth, it really is a different issue, although it too will probably require a persistence layer addition.

friesendrywall avatar Jul 30 '22 19:07 friesendrywall

Ideally, the persistence layer would track the last client keepAlive, but that would probably affect the overall performance.

friesendrywall avatar Jul 30 '22 19:07 friesendrywall

Aedes checks and sends LWT periodically. If you restart the broker with same broker id, the logic of LWT delivery resumes in its first heartbeat for non-memory persistence. https://github.com/moscajs/aedes/blob/main/aedes.js#L99-L111 @robertsLando @friesendrywall

gnought avatar Jul 31 '22 13:07 gnought

@gnought Lines 119 to 121 filter any active brokers so previous wills don’t ever go out because they don’t “needsPublishing”. This all works ok with auto generated id’s.

friesendrywall avatar Jul 31 '22 16:07 friesendrywall

Couldn't we fix this by simply using a flag that will force send existing wills on startup like @friesendrywall did in first PR? If we wait the broker heartbeat interval it could be the client has been already reconnected to broker

robertsLando avatar Aug 01 '22 09:08 robertsLando

IMO a server should not send a client LWT if a server restarts within a client keepalive period.

gnought avatar Aug 01 '22 12:08 gnought

@gnought Yeah but what if the keep alive period is already elapsed and, in the meanwhile, client connects? That Will would never be published. I think it would be more acceptable to publish it before keep alive interval instead of not publish it at all, also a server failure should be considered as an unexpected disconnection so it should be always notified

robertsLando avatar Aug 01 '22 12:08 robertsLando

@mcollina If you have other suggestions them would be much appreciated :)

robertsLando avatar Aug 02 '22 06:08 robertsLando

What I understand the MAY defer is

the server MAY defer publication of Will Messages until a subsequent restart

  • if a server restarts within a client keep alive period, no LWT should be sent. This could be regarded as a network hiccups in client perspective.
  • If the restart is over client keep alive period, LWT should be sent.

@robertsLando yes, if a broker id is same in each subsequent restarts:

  • In a cluster mode, the restarted broker brokers array will be re-populated during the heartbeat. Aedes will send old LWT.
  • In a non-cluster mode with non-memory persistence, Aedes will also send old LWT after restart.

gnought avatar Aug 02 '22 10:08 gnought

@gnought Ok I agree with you, makes sense.

Just to try I did a test with mosquitto and this is what happped (version 2.0.14):

  • If mosquitto is closed gracefully it immediatly sends LWT messages before closing, this could be correct
  • If mosquitto is force closed (sudo kill -9 $(pidof mosquitto)) even if the broker remains offline for a period greather then the keepalive period it never sends LWT (that's definetly wrong)

robertsLando avatar Aug 02 '22 12:08 robertsLando

Do you have a external persistence session storage layer in mosquito server? @robertsLando

gnought avatar Aug 02 '22 12:08 gnought

Do you have a external persistence session storage layer in mosquito server? @robertsLando

Yes, I have enabled persistence in mosquitto.conf. Also the db file exists so it means he is using it

robertsLando avatar Aug 02 '22 13:08 robertsLando

Hmm, this is same as the current Aedes behaviour. The obsolete LWT is left in the persistent storage of mosquito.

gnought avatar Aug 02 '22 13:08 gnought

Hmm, this is same as the current Aedes behaviour

Aedes is not sending LWT on close BTW

robertsLando avatar Aug 02 '22 16:08 robertsLando

I'm closing this pull request because I don't have the resources to complete, also I'm not sure how to bring this together.

friesendrywall avatar Aug 29 '22 14:08 friesendrywall