ubxlib MQTT Broker Disconnects MQTT +UUMQTTC 0,100

Using the SARA-R5 and connecting to AWS IoT Core im running into a weird issue where if my connection.inactivityTimeoutSeconds is over 300 it disconnects with an error that equates to: "keep alive time expired, the MQTT broker released the connection." I would like to use an inactivityTimeoutSeconds of 1200 (which is the max that aws IoT Core supports but something seems to be preventing that? Or maybe I'm doing something wrong here?

//setup the connection
connection.pBrokerNameStr = AWS_BROKER;
connection.inactivityTimeoutSeconds = 60;
connection.keepAlive = true;
connection.pClientIdStr = get_device_mac();

Sep 23 '23 06:09 michaelboeding

That looks correct to me: if you have an at AT log you should see:

AT+UMQTT=10,60

...being sent to the module around the time the client is being configured before connection.

@philwareublox: I know you've looked into these 'ere timeouts a lot, are there any things I might have misunderstood about them?

Sep 23 '23 19:09 RobMeades

I did more testing on this and looked into the AWS IoT Core broker that Im using. Apparently you can Ping the broker as quickly as 30 seconds without being billed. So this isn't a critical item for my application but still somewhat weird. Only thing I am still kind of worried about is if I try to create a power saving mode that should only check in say every 1200 seconds, Id be worried that the connection would be lost.

This may be a stupid question but I was also looking for some type of callback when this event happened (error or disconnects so I could react to them) but couldn't seem to find any.

Sep 23 '23 19:09 michaelboeding

Would a callback set with uMqtttClientSetDisconnectCallback() not fire?

Sep 23 '23 19:09 RobMeades

I completely missed that...thanks! I'll add it now.

Sep 23 '23 19:09 michaelboeding

Just some more context for this I am seeing this come up after the cellular module has been running for awhile. Not sure if its related to the above issue or not.

Creating CellularLayer...
I (164749) uart: queue free spaces: 20
U_CELL: initialising with enable power pin 2 (0x02) (where 1 is on), PWR_ON pin 25 (0x19) (and is toggled from 1 to 0) and VInt pin 36 (0x24) (and is 1 when module is on).
I (164779) gpio: GPIO[25]| InputEn: 0| OutputEn: 1| OpenDrain: 1| Pullup: 1| Pulldown: 0| Intr:0 
I (164789) gpio: GPIO[2]| InputEn: 1| OutputEn: 1| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0 
I (164799) gpio: GPIO[36]| InputEn: 1| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0

Sep 24 '23 20:09 michaelboeding

Which part of the print above is the bit that comes up after it has been running for a while? I guess Creating CellularLayer comes from the application code (at least, I don't recognise it), while the print U_CELL: initialising with... is emitted from a call to uCellAdd(), which uDeviceOpen() calls, and I doubt that you would be calling that part way through your application's life. The other prints are all from ESP-IDF of course, when the UART to talk to the cellular module is opened and the GPIOs to control it are configured, all again startup things.

Sep 24 '23 20:09 RobMeades

Yes - that's my code. Im switching between BLE, Wi-Fi, and Cellular. So I would assume that the onDissconect got called for the mqtt. And then it tried to reactivate the cellular layer. It does look like I reset the whole layer on a disconnect and then retry....Im going to mess with changing that around...it definitely doesn't need to rebuild the layer just reconnect.

Sep 24 '23 20:09 michaelboeding

Complicated stuff. Unless you are resource constrained it would indeed probably be best to do the uDevice stuff once at start of day/end of day and only do network up/down, or call the layer-specific functions as required, between.

If you do call uDeviceClose(), to free resources, then, for cellular, consider whether you want the module to power down or not; usually for any intermediate stuff it is best to uDeviceClose() it but not switch it off, otherwise you need to allow for a 10ish second power-on time. Only power it off when you're done (unless you are power constrained and/or don't care about the time).

Sep 24 '23 21:09 RobMeades

Yeah its starting to drive me crazy....getting closer finally and stuff is starting to come together but write the payloads and servicing them via 3 separate transports is a lot. Not to mention all the app and server related stuff.

Thanks I'll look into adding that, I am more power constrained than anything. I actually don't think this has anything to do with the original keepAlive problem now that I'm debugging a little more.

Sep 24 '23 21:09 michaelboeding

Just a quick update:

It looks like the PING message being sent is small (2 bytes or so) but the TLS overhead attached to each PING message can still be around 38 bytes per send. So being able to have a long keep alive time is actually pretty critical for constrained devices in both power and data usage.

I'm going to do some more testing with longer keep alive times, 60 seconds seems to work all the time though just an fyi.

Sep 24 '23 23:09 michaelboeding

Heres a quick calculation

Example 1: Keep-Alive Time = 60 seconds (With TLS Overhead)

Variables:

Payload Size per Message: 33 bytes
TLS Overhead: 40 bytes
Keep-Alive Time: 60 seconds
Ping Message Size: 2 bytes + 40 bytes (TLS) = 42 bytes
Ping Frequency: Every 60 seconds
Data Payload Frequency: Every 1800 seconds (30 minutes)

Calculation:

Number of data payloads in a day: 86400 seconds/day / 1800 seconds = 48
Number of ping messages in a day: 86400 seconds/day / 60 seconds = 1440
Total bytes for data payload in a day: 48 payloads * (33 bytes + 40 bytes) = 3504 bytes
Total bytes for ping messages in a day: 1440 pings * 42 bytes = 60480 bytes
Overall total bytes in a day: 3504 bytes (data) + 60480 bytes (ping) = 63984 bytes/day
Total in MB: 63984 bytes = 0.06098 MB/day
Days to reach 1 MB: 1 MB / 0.06098 MB/day = 16.40 days

Example 2: Keep-Alive Time = 1200 seconds (With TLS Overhead)

Variables:

Payload Size per Message: 33 bytes
TLS Overhead: 40 bytes
Keep-Alive Time: 1200 seconds
Ping Message Size: 2 bytes + 40 bytes (TLS) = 42 bytes
Ping Frequency: Every 1200 seconds
Data Payload Frequency: Every 1800 seconds (30 minutes)

Calculation:

Number of data payloads in a day: 86400 seconds/day / 1800 seconds = 48
Number of ping messages in a day: 86400 seconds/day / 1200 seconds = 72
Total bytes for data payload in a day: 48 payloads * (33 bytes + 40 bytes) = 3504 bytes
Total bytes for ping messages in a day: 72 pings * 42 bytes = 3024 bytes
Overall total bytes in a day: 3504 bytes (data) + 3024 bytes (ping) = 6528 bytes/day
Total in MB: 6528 bytes = 0.006216 MB/day
Days to reach 1 MB: 1 MB / 0.006216 MB/day = 160.85 days

Sep 24 '23 23:09 michaelboeding

Thanks for the numbers: plainly a huge difference.

Do you have a log of what the AT interface is sending around the point that the unexpected 300-second-inactivity disconnect occurs? This would show the disconnect cause on the end of the +UUMQTTC: 0,x URC, and the ubxlib code will also have asked the module for the last error code, so in addition it should show something like:

AT+UMQTTER
+UMQTTER: y,z
OK

...where z is of particular interest, a module-type-specific error code, in your case listed in section A.8.4.1 of the SARA-R5 AT manual.

Sep 25 '23 07:09 RobMeades

Yes heres the log Im getting when it disconnects. Not sure if this is very useful

+UUMQTTC: 0,100
AT
AT

OK
AT+UMQTTER

+UMQTTER: 13,0

Sep 25 '23 16:09 michaelboeding

Hmph: yes, you're right, not very useful. How sure are you that this is not the broker? I guess when you're connected over Wifi via ESP-IDF you don't get an inactivity timeout from the broker at 300 seconds?

Sep 25 '23 16:09 RobMeades

Correct - very sure this isn't the broker. That and its AWS IoT Core so that max keep alive is 1200. Seems to be right around the 300 mark - I can try to test and see if there's an exact time.

Sep 25 '23 16:09 michaelboeding

It's just that I can't see how the module can be doing it; there is no reason for the module to run an inactivity timer, it should have no concept of inactivity. Unless the cellular network's firewall is closing the TCP connection at 5 minutes to save on resources?

Don't suppose you are using a roaming SIM are you? If so it might be persuaded to try a different network.

Sep 25 '23 16:09 RobMeades

Is there anyway to get more debug logs from it to see exactly what's going on? They definitely could be, I have 3 separate SIM cards im using right now but they are all from different providers. Currently using the Super SIM from Twilio, but have also seen this on Symetry and Hologram SIM cards.

Sep 25 '23 16:09 michaelboeding

Is there anyway to get more debug logs from it to see exactly what's going on?

The logging tools might need an NDA (and would certainly need access to the USB port of the module, don't know if that's a problem), I'll check.

The other experiment you could try is opening a TLS socket to somewhere else, i.e. not the broker, and see if that gets spontaneously closed by the far end at the 300 second-mark: see our example of how to open a TLS socket - you could do this to ubxlib.com:5065, which is our test server (you'll see this is what the example does).

Sep 25 '23 16:09 RobMeades

Looks like from my logs.

First attempt closed the mqtt connection after 2 minutes and 30 seconds Second attempt closed the mqtt connection after 2 minutes and 30 seconds

So it seems to be happening at the same time even with different inactivityTimeoutSeconds set.

Sep 25 '23 16:09 michaelboeding

Is there anyway to get more debug logs from it to see exactly what's going on?

The logging tools might need an NDA (and would certainly need access to the USB port of the module, don't know if that's a problem), I'll check.

The other experiment you could try is opening a TLS socket to somewhere else, i.e. not the broker, and see if that gets spontaneously closed by the far end at the 300 second-mark: see our example of how to open a TLS socket - you could do this to ubxlib.com:5065, which is our test server (you'll see this is what the example does).

I will try this - if required an NDA is fine.

Sep 25 '23 16:09 michaelboeding

All of my SIMs are roaming SIMS btw.

Sep 25 '23 16:09 michaelboeding

I have been experimenting with some other stuff (I still need to test with your example). But while running on a non-secure self hosted mqtt broker I got this error message after about maybe 5 minutes? This is with a keepAlive of 30 seconds.

+UUMQTTC: 8,0

+UUMQTTC: 0,100 AT

OK AT+UMQTTER

+UMQTTER: 13,33

Sep 25 '23 23:09 michaelboeding

I was finally able to get your example running and it looks like at 2 minutes and 30 seconds I run into a message that is

+UUSOCL: 0

Looks like this is the socket closed....definitely seems like something in TCP/Firewall/Proxy settings are maybe causing the timeout which is not ideal for MQTT. Is there anything else I can do to help debug this?

Sep 26 '23 00:09 michaelboeding

Hi Michael. Different networks will behave differently, and differences with Roaming vs Home SIMs too. If you are finding the TCP socket is being closed after the same 2mins 30 seocnds, I'd suggest that's just how the cellular's network's firewall is configured. The cellular network only has a limited number of resources and if the module is not chatting, it will be closed after x time. Unfortunately this is much less time on a cellular network compared to a wired network.

Sep 26 '23 07:09 philwareublox

The AT+UMQTTC=8,x command sets the MQTT ping, which is sent to the MQTT broker at the moment the keep-alive timer expires, to on or off. I've personally never seen the "8" emitted as a URC; my guess would be that +UUMQTTC: 8,0 might mean that the ping has failed, i.e. for whatever reason the module did not get an ack back from the broker for the ping and that has led to a disconnect. The last error code, +UMQTTER: 13,33, the 33 means timeout (see section A.8.4.1 of the SARA-R5 AT manual), which would point in that direction. I will enquire internally as to whether this interpretation is reasonable.

On the original point though, your experiment with running a TLS socket to our echo server shows that there is something between your application and any TCP peer entity, when running over cellular, which has a 2.5 minute inactivity timeout, so the culprit widens beyond MQTT. As @philwareublox says, this is likely the cellular network: in this case logging the module behavour is likely not worth it as all we will see is the network waking up the connection to close the TCP socket.

Some possible avenues of investigation:

repeat the last experiment with a non-secure connection (use ubxlib.com:5055 as your test peer and omit the call to uSockSecurity()) just to see if this timeout is somehow related to secure connections only.
try a different cellular network: for this we'd need to determine which networks you can see from your location: the for() loop in this example shows how to do a scan and read the networks out. You will see something like:

AT+COPS=?
+COPS: (2,"3 UK","3 UK","23420",7),(1,"001 01","001 01","00101",7),(1,"234 32","234 32","23432",7),(1,"O2 - UK","O2 - UK","23410",7),(1,"EE","EE","23430",7),(1,"vodafone UK","voda UK","23415",7),(1,"262 95","262 95","26295",7),,(0,1,2,3,4),(0,1,2)
OK
### 7: network: 23420
### 5: network: 00101
### 4: network: 23432
### 3: network: 23410
### 2: network: 23430
### 1: network: 23415
### 0: network: 26295

You could then try issuing uCellNetConnect() with the pMccMnc field set to one of these networks, e.g. for the network 23420 you would use the string "23420"; pick a network that is NOT the one you are currently registered to and see what happens. Depending on the roaming agreements your SIM provider has in place this may or may not succeeded but, if it does succeed, you could retry the "does my socket close" experiment to see if that network is configured differently.

If you find one that does work better for you we could try configuring the network preference list in the SIM to put that at the top.

Of course, I don't know whether such an approach would help with your final application.

Another thought is whether there's a less-costly thing you could do which would somehow tickle the cellular network into keeping the TLS connection you are concerned with alive. Just as an experiment, it might be worth opening a socket to ubxlib:5055 [i.e. non-secure] AT THE SAME TIME AS YOUR MQTT CONNECTION and sending a byte or so over that non-secure connection, say, every two minutes. If the thing in the cellular network that is getting in our way is not so much an IP-level firewall as a gateway service which knows which mobile device it is serving, it may see this as an indication that the mobile is alive and therefore not shut down any connections associated with it. Clutching at straws, I know.
Finally, of course, there's talking to your network provider; at a pinch maybe this is something that could be configured for your account. Or maybe there are other SIM providers who specifically offer such a feature as a USP for their APNs?

Sep 26 '23 07:09 RobMeades

FYI, on the +UUMQTTC: 8,0 thing, that interpretation is likely correct: the MQTT client in the module sent a PINGREQ packet to lower layers and waited for a PINGRESP packet; for some reason transmission of the outgoing PINGREQ packet failed and the broker, not receiving the PINGREQ in time (timeout) closed the MQTT session, hence the +UUMQTTC: 0,100.

Sep 26 '23 08:09 RobMeades

@eeflis: if you're listening, I know that your application is a very low power one, I'm not sure if you've seen issues with the cellular network always shutting down any TCP connection you might have open at around 2.5 minutes of inactivity? If so, have you found a way to persuade it to extend such timers to more like 20 minutes?

Sep 26 '23 08:09 RobMeades

Hi Rob Yes I have followed the conversation. We use a protocol other than MQTT. However, we generally use application-layer security rather than transport-layer security so that a security handshake does not have to be performed after each connection establishment. This way, there are also no problems with NAT and invalid security context and the device can be in deep sleep for as long as desired. Whenever possible we also use UDP instead of TCP especially for cellular connections which are not always stable. This additionally minimizes the protocol overhead and reduces the amount of data and power consumption.

Sep 26 '23 09:09 eeFLis

All great info and I'll look into some of this.

AT&T seems to be a 1800 connection timeout for TCP. https://developer.att.com/technical-library/network-technologies/network-timers

SuperSIM TCP seems to be fairly long also at 6 hours. https://www.twilio.com/docs/iot/supersim/super-sim-network-timeouts

Sep 26 '23 13:09 michaelboeding

I guess the next question while I try to figure out provider constraints directly from my sales contacts is does anyone have any recommendations besides MQTT for transport to reduce the data payloads over cellular. I technically can change my protocol based on my transport layer in my application, not ideal and may require a gateway to connect to AWS IoT Core but the current usage seems too high.

Sep 26 '23 15:09 michaelboeding

ubxlib ubxlib copied to clipboard

MQTT Broker Disconnects MQTT +UUMQTTC 0,100

Example 1: Keep-Alive Time = 60 seconds (With TLS Overhead)

Variables:

Calculation:

Example 2: Keep-Alive Time = 1200 seconds (With TLS Overhead)

Variables:

Calculation:

ubxlib
ubxlib copied to clipboard