lofence RN2483 send timeout?

What firmware of the RN2483 are you running, 1.0.5 or an earlier version? And what settings did you use on TTN side for the device?

I'm having some issues where the RN2483 sometimes doesn't give a response anymore to a TX command. It would do multiple rounds OK, and then suddenly it acknowledges the command to send an unconfirmed message, but after ~1 hour it still sits there with no response on the UART from the RN seemingly (I didn't tap the UART yet directly from the RN in parallel, will try that). In the docs of the RN2483 there is a possible timeout response mentioned, so should it need to be set explicitly perhaps?

example:

Sleeping for 4320 seconds

Round: 3
Measuring
Measuring battery: 4684 mV
Measuring fence positive: 0 V
Measuring fence negative: 0 V

Transmitting
RN2483 wakeup and baud change
RN2483 RX clearing
RN2483 TX: mac join abp
RN2483 RX: ok
RN2483 RX: accepted
RN2483 TX: mac get upctr
RN2483 RX: 25
RN2483 TX: mac set pwridx 1
RN2483 RX: ok
RN2483 TX: mac tx uncnf 1 0003124C00000000
RN2483 RX: ok
RN2483 RX: mac_tx_ok
RN2483 TX: mac save
RN2483 RX: ok
RN2483 TX: sys sleep 86400000

Sleeping for 4320 seconds

Round: 4
Measuring
Measuring battery: 4658 mV
Measuring fence positive: 0 V
Measuring fence negative: 0 V

Transmitting
RN2483 wakeup and baud change
RN2483 RX clearing
RN2483 TX: mac join abp
RN2483 RX: ok
RN2483 RX: accepted
RN2483 TX: mac get upctr
RN2483 RX: 26
RN2483 TX: mac set pwridx 1
RN2483 RX: ok
RN2483 TX: mac tx uncnf 1 0004123200000000
RN2483 RX: ok

/edit: found some other references to this https://www.thethingsnetwork.org/forum/t/rn2483-timing/36278 https://www.microchip.com/forums/m1101599.aspx

So for now I've put a (naive) 5 second timeout in the rn2483_rx() function to cope with this. Will let it run again overnight and see how it behaves

May 31 '21 14:05 thomasdupas

Hmmm. Unfortunately I haven't seen this problem before. I have 10 devices running and I didn't observe a device being stuck. Or - to be more precise - I didn't see it while testing. The devices are in the field and maintained by friend. I wouldn't notice if he reboots devices sometimes.

I am pretty sure that all my devices run RN2483 firmware 1.0.5.

Adding the timeout is a good idea anyways. Please provide feedback on your experience.

May 31 '21 16:05 kiu

Actually I can see resets as I am monitoring them. But I don't know if they were intended (device was moved / battery changed) or because of a lock up.

Each spike is reset on a device.

May 31 '21 16:05 kiu

For now it's going steady again, perhaps too soon to tell. But I guess the timeout doesn't hurt anyhow.

I saw a prior ticket with some potential changes ( https://github.com/kiu/lofence/issues/7 ) which didn't go through yet since kicad 6 isn't released yet. I'll probably hop on that wagon as well, since updating the keys and creating a custom hex file gets error-prone very quickly.

I'll probably create some pyserial wrapper around a json/config file which has the hweui's and TTN keys. Then it can do the RN<->TTN config mapping by itself and every AVR can have the same firmware (PS, why the EEPROM save every time, it's only altering those settings once, not concerned about EEPROM wear?).

Didn't use lorawan before, so have to read up on OTAA vs ABP and see what that entails possibly

Jun 01 '21 14:06 thomasdupas

(PS, why the EEPROM save every time, it's only altering those settings once, not concerned about EEPROM wear?).

I believe - its been a while - that the frame counter is only persisted on a "mac save". As I have enabled that for the devices in the TTN console, I need to ensure that the value is not lost. Otherwise the RN2483 on a reboot will start with frame counter 0 and the messages are discarded. There is probably no need to enable frame counters (replay attacks aren't a strong vector in my scenario).

Didn't use lorawan before, so have to read up on OTAA vs ABP and see what that entails possibly

I didn't use OTAA for my devices, as there is really bad coverage in the area I am setting them up. OTAA - not on every message though - requires additional packets being transmitted to join the network and there would be a higher chance of packet loss.

Jun 01 '21 14:06 kiu

I followed this issue and because you mentioned #6 here my comment. I wanted to change and use a ready to go lib for the RN module, but as it goes it didn't have the time and just changed the code to do what I need and thought was right. That is burning the keys to the RN module using an extra script before programming the ATmega and then using soft ABP joining. I recently switched to an OTAA rejoin when the device gets reset/restarted because of some issues witch my LoRa gateway sometimes having troubles with fCnt of packets and re-joining resets that counter and starts over... This is how my other LoRa sensors are doing it and as the devices have a very long uptime I don't see an issue there anymore. Initially I thought re-joining should not occur on every restart but what I really meant was not too often and if a device is up for a year and then the battery has to be changed this is not too often for a re-join... One or two times since I use the devices I had a hanging RN module, the device didn't post updates, I opened it and the transmit LED lit continually but I wasn't able to debug and get the cause if it. A restart solved it... I thought of implementing a watch dog but I have no idea if that is even possible with the ATmega chip...

Jun 01 '21 15:06 Alex9779

@Alex9779 "One or two times since I use the devices I had a hanging RN module, the device didn't post updates, I opened it and the transmit LED lit continually but I wasn't able to debug and get the cause if it. A restart solved it..." That's exactly what I had yesterday / see start of issue ticket.

What I changed (apologies for the ugly code, this is outside my comfort zone ;-)) to have a ~5 second timeout

void rn2483_rx() {
	char nc = 0x00;
	uint8_t len = 0;
    
    // erase buffer_rn to make sure it doesn't contain data from prior read
    memset(buffer_rn, 0, 255);

    // do 5000 loops, bit naive way to yield a 5 second timeout
    for(int i = 0; i < 5000; i++){
        if (USART_0_is_rx_ready()) {
            nc = USART_0_read();
            buffer_rn[len] = nc;
            len++;
            if (nc == '\n') {
                break;
            }
        } else {
            _delay_ms(1);
        }
	}
	buffer_rn[len+1] = '\0';
    
	#ifdef DEBUG
	debug("RN2483 RX: ");
	debug(buffer_rn);
	#endif
}

Jun 01 '21 16:06 thomasdupas

Well it is hard to test, maybe have a look in that lib and see if they have implemented some kind of watch dog: https://github.com/jpmeijers/RN2483-Arduino-Library

Jun 01 '21 17:06 Alex9779

no watchdog at first glance, but timeouts on the serial reads on many places like

        _serial.setTimeout(30000);
        receivedData = _serial.readStringUntil('\n');
        _serial.setTimeout(2000);

Jun 01 '21 17:06 thomasdupas

Yeah well, as already said I have it running in a way it suits me... If I wanted to sell those devices then yes, the firmware would need a rework, but for now I don't need it. A hang rarely happens, most devices didn't have a single hang until the battery ran out so...

Jun 01 '21 17:06 Alex9779

It is my understanding that once the module hangs, it won't recover later on. Can you confirm that? If that is the case, you would need to reset the module in case your timeout is triggered: rn2483_init();, e.g. in the error routine: https://github.com/kiu/lofence/blob/master/rev_b/firmware/LoFence/main.c#L346

It is also possible to integrate the hardware watchdog into the system, but be aware that the current code immediately starts to send a packet through TTN after startup. A device with problems restarting through the watchdog all the time, could violate the fair use policy of TTN. The RN2483 itself covers any regulatory violations, but who knows what the module actually will do, if it hangs/crashes. Probably best solution for a watchdog would be to go to sleep after startup instead of starting with a measurement to insure proper delays in case the device goes roque.

Jun 01 '21 17:06 kiu

As far as I can tell from other LoRa devices TTN fair use is one thing, local regulations something other and what a device does also. I don't use TTN because for my fence monitoring the fair use is not enough for me to get alerts in time. I have my own gateway on Chirpstack. TTN just ignores packets above the fait use, you won't get them in your app. What happens if your device constantly breaks local restrictions I can't tell... I think it depends if you really interfere with something important then they come and analyse. My approved devices for example try to join every 30 seconds after a reboot I think but I have no idea for how long if they don't get an answer. That is much above what TTN allows. I agree if the module hangs and you have to reset you also reset the regulation restrictions and that way you can break them, send more packets than allowed. And as far as I know the approved devices have some kind of timeout, they try but stop after some time and require a manual reset to try again to not overwhelm the band...

Jun 01 '21 17:06 Alex9779

Just documenting it here, for future reference. I tried the absence of the eeprom save command, but that's indeed needed to keep track of the frame counter. It confused me that the docs said it was needed after mac set <xyz> commands (and there exists an explicit downlink/uplink frame counter set command), but implicitly the module is bumping that every transmission I guess, hence the need for the save. Tested here below, The last 2 cycles was with firmware without the mac save command (and a reset in-between)

As for the RN reset / hang, very possible indeed, but hard to provoke to be certain. For testing purposes I'll leave it out for now and see what happens.

Jun 01 '21 17:06 thomasdupas

I used that mac save in my firmware but only every 100 uplinks to save the eeprom from wearing out, on a new start I increased the counter by 100 to be after any count between where the gateway might be... But I changed to a rejoin on a reset so no need to save for now...

Jun 01 '21 17:06 Alex9779

I want to share some recent impressions on this topic. I do not use TTN but have an own gateway running ChirpstackOS. I stumbled over some special handling in the Chirpstack code regarding the RN2483 and the LoRa power index sent to the device if in ADR mode. The issue seems to be that the RN2483 does not handle a power index of 0 properly. Though this seems to be taken care of in the base code of Chirpstack I created an own ADR algorithm plugin and changed it so power index 0 is never used despite some other changes I want to have for my installation. Since I use (about 1 month at the time of writing) this ADR algorithm for my LoFence devices I never had a hanging device again.

Aug 25 '21 11:08 Alex9779

here is the code in Chirpstack I referred: https://github.com/brocaar/chirpstack-network-server/blob/3971570b77c79c1cfd184b6f06a4f1770b5a0db0/internal/maccommand/link_adr.go#L119

Aug 25 '21 11:08 Alex9779

Oh and yes I know there is firmware 1.0.3 mentioned, i checked one device and I have 1.0.5 on it. But since there was a problem in 1.0.3 no idea if the fix in 1.0.5 works properly or has some side effects (maybe works and accepts the command but can crash under certain circumstances the device). I will monitor and see how long the system is stable with the new ADR for the RN2483 and report if I see a hanging device again...

Aug 25 '21 11:08 Alex9779

About a month passed and system still stable, no hanging devices after changing to my ADR algorithm. I had one device having an issue but it turned out the crystal had a connection issue and the clock didn't get a signal so the device didn't wake up after initial join and first measurement. I changed batteries of all six devices yesterday, all joined without issues and sent ~200 updates...

Sep 20 '21 07:09 Alex9779

lofence lofence copied to clipboard

RN2483 send timeout?

lofence
lofence copied to clipboard