ubxlib icon indicating copy to clipboard operation
ubxlib copied to clipboard

do not "Taking down network..." after Unable to connect to server?

Open alexmaron81 opened this issue 3 years ago • 11 comments

Hi Rob,

image

and here the code getNTP.txt

alexmaron81 avatar Aug 09 '22 09:08 alexmaron81

Hi Alex: well it looks as though the module is not responding to anything, ever: not a single response to any AT command in that picture you have pasted. So whatever went wrong, it was before this. Do you have more information?

RobMeades avatar Aug 11 '22 20:08 RobMeades

Hi Rob,

what would be the general correct procedure to respond to any "unable to..."?

Our application: -collects data -connects to the MQTT service and sends the data.

there are some cases that can occur: -Unable to publish our message -Unable to subscribe to topic -Unable to connect to MQTT broker -Unable to create MQTT instance! -Unable to bring up the network!

alexmaron81 avatar Aug 16 '22 07:08 alexmaron81

What I've tried so far is to reboot the module if no connection could be established.

image

alexmaron81 avatar Aug 16 '22 07:08 alexmaron81

and what exactly does the function have to do with? uCellPwrRebootIsRequired(devHandle)

alexmaron81 avatar Aug 16 '22 08:08 alexmaron81

Hi Alex: for an "unable to" related to MQTT I would suggest simply retrying a few times; the ubxlib code already tries a few times if the reported error could be due to radio conditions. For an "unable to" related to the network, taking the network down and up again might suffice but there is no harm in performing a soft re-boot, i.e. calling uCellPwrReboot() provided you don't call it too often; the advantage of calling uCellPwrReboot() is that if the module is unresponsive it will try successively harder reboot mechanisms until the module comes back; the disadvantage is that pulling the power while the module is on is not advised, in extreme cases it might brick the module but, on the other hand, if the module is not responsive then it is not a whole lot of use to you either.

The function uCellPwrRebootIsRequired() checks an internal flag that the ubxlib code will set if you have called a function such as changing the MNO profile, or the band-mask, etc., one which a given module type might require you to reboot afterwards for the setting to take effect. You can call it as a kind of "did I forget to reboot" reminder, if you see what I mean.

RobMeades avatar Aug 16 '22 17:08 RobMeades

Hi Rob, I still have one thing.

when the application wants to do two things "get the current time" and then "connect to mqtt", but I don't want to start and shut down the module for every task. How do I do it best if e.g. does not get the correct time (I would have to shut down the module to try again) but the connection to the mqtt is still pending?

what is the best way to build the logic?

alexmaron81 avatar Aug 17 '22 13:08 alexmaron81

OK, I think I recall from code you've posted here before that your application has several asynchronous tasks, any of which might be calling into the ubxlib APIs. If that is the case then I guess you need to have one task which is the main controller/supervisor of the module and have your other tasks respect that task's control somehow.

You could maybe wrap all access to the ubxlib APIs from non-supervisor tasks in accessor functions so that they could either return a specific "NOT NOW, SUPERVISOR IS ACTING" failure, or could block on a semaphore (though you'd need to make sure you don't get stuck with that approach) if the supervisor was doing something. Then the supervisor task could poke the module with uCellPwrIsAlive(), say every 10 seconds, and, if it didn't respond for, say, 6 attempts, the supervisor could block everyone else and reboot the module.

If you wanted the recovery process to be more reactive than that, you could have any task that suspects the module is not behaving signal the supervisor task to get it to take some sort of coordinated action, e.g. do an immediate uCellPwrIsAlive() check or some such.

The danger with all of these is that the supervisor task will have no context, no "system state" knowledge, so it could blunder in and reboot when actually you'd asked the module to do something that just takes a lot of time. You could end up making the problem worse; this is simply down to testing, testing, testing, testing of course.

EDIT: to be clear, ubxlib is thread-safe, so there is no coordination issue at that level, I believe what you are asking is how you implement some form of coordinated module-monitoring/control in your multiple-task application.

RobMeades avatar Aug 17 '22 14:08 RobMeades

I get this stuff as it rolls past, but you need tiers, where you determine different levels of connectivity and viability. You understand what tiers are working, which are not. It's like an onion, build the logic with some realistic expectations it's not going to work some of the time. If the network is down no amount of endless retries and restarts at the user/application level is going to fix it. You need to be able to stop and defer to a later time. If the data's critical, store it. Come back in an hour for the time. Use the SMS channel if there's no viable data path. If your primary server isn't working you fall back to alternates, or ones hosted by different third parties. You have redundancy, you have awareness. The carriers really hate devices that constantly restart and misbehave on the network. It's a good way to get blacklisted, or your device certification audited or revoked.

cturvey avatar Aug 17 '22 17:08 cturvey

Hey Rob, here is e.g. a problem that if the ntp time cannot be fetched, the module no longer reacts correctly.

What could it be? log2.txt

alexmaron81 avatar Aug 19 '22 09:08 alexmaron81

Hi there: I guess you arw talking about this line:

1660827761.343956,No reply received!

The UDP packet to get the NTP time was sent here:

1660827751.307029,@[1b][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00]

...so I guess from the time-stamp that you waited 10 seconds for the reply, which may be fine: this is UDP after all, the response either arrives in a reasonable time-frame or it has been lost, which seems to be the case here.

On your next attempt, you reboot SARA-R4 before trying again, but the trace following that looks a bit strange:

1660827852.912474,"U_SOCK: socket created, descriptor 1, network handle 0x3ffbb2b8, socket handle 1."
1660827852.916488,"U_SOCK: connecting socket to ""82.219.4.30:123""..."
1660827852.995228,"AT+USOCO=0,""82.219.4.30"",123"
1660827853.0313306,
1660827853.03301,OK
1660827853.0487957,"U_SOCK: socket with descriptor 1, network handle 0x3ffbb2b8, socket handle 1, is  connected to address ""82.219.4.30:123""."
1660827853.0508997,Sending data...
1660827853.0536003,Sent 48 byte(s) to echo server.
1660827853.0648353,
1660827853.0665655,No reply received!
1660827853.0682511,Closing socket...
1660827853.1326637,AT+USOCL=0
1660827853.1712759,
1660827853.1720967,OK

There's a debug print that says Sending data... but no actual data is sent, there seems to be no call to uSockSendTo(), or maybe there is but with zero length data, and that pattern repeats for the rest of the log. Worth taking a look at the logic around that.

More generally, I would try re-sending the UDP packet before I do a reboot: as @cturvey says, some networks, especially NB1/Cat-M1 networks, have policies in place, which will bar your device if it powers up and down too often. Similarly, your application behave seems, from the log, to be quite "uppy-downy", if you see what I mean, discrete blocks of "do this please SARA-R4 and then switch off again": while this will save power it may fall-foul of the network's radio policy; networks prefer modules to go into a power saving mode of some form rather than incur the signaling cost.

RobMeades avatar Aug 19 '22 10:08 RobMeades

Having a bit of a clean-out of the issues list: did we resolve this one @alexmaron81?

RobMeades avatar Oct 05 '22 09:10 RobMeades

Closing this one: please re-open if necessary.

RobMeades avatar Mar 02 '23 22:03 RobMeades