LoRaMac-node icon indicating copy to clipboard operation
LoRaMac-node copied to clipboard

Stack might get stuck when dutycycle restricted and nbtrans > 1

Open msdade opened this issue 4 years ago • 11 comments

Hello, We have noticed that when frame NbTrans is greater than 1 the stack might get stuck if dutycycle restriction kicks in. When OnTxDelayedTimerEvent calls ScheduleTx it assumes that if LORAMAC_STATUS_OK or LORAMAC_STATUS_DUTYCYCLE_RESTRICTED is returned that the function succeeded.

On ScheduleTx there is if that if dutycycle is restricted and MacCtx.DutyCycleWaitTime is greater than 0 then start wait timer. If this time is 0 then it just returns LORAMAC_STATUS_OK even the frame is not transmitted or wait timer is not started. I think that the return LORAMAC_STATUS_OK should be inside the timer start if and not after it to overcome this issue.

if( status != LORAMAC_STATUS_OK )
{
  if( ( status == LORAMAC_STATUS_DUTYCYCLE_RESTRICTED ) &&
        ( allowDelayedTx == true ) )
    {
        // Allow delayed transmissions. We have to allow it in case
        // the MAC must retransmit a frame with the frame repetitions
        if( MacCtx.DutyCycleWaitTime != 0 )
        {// Send later - prepare timer
            MacCtx.MacState |= LORAMAC_TX_DELAYED;
            TimerSetValue( &MacCtx.TxDelayedTimer, MacCtx.DutyCycleWaitTime );
            TimerStart( &MacCtx.TxDelayedTimer );
            return LORAMAC_STATUS_OK;
        }
    }
    else
    {// State where the MAC cannot send a frame
        return status;
    }
}

This situation happens when RegionCommonUpdateBandTimeOff in RegionCommon.c checks band off time. There is check bands[i].TimeCredits > creditCosts which doesn't set band ready for transmission if the current time credits equals of credit cost. Should this be greater or equal and not just greater?

// Check if the band is ready for transmission. Its ready,
      // when the duty cycle is off, or the TimeCredits of the band
      // is higher than the credit costs for the transmission.
      if( ( bands[i].TimeCredits >= creditCosts ) ||
          ( ( dutyCycleEnabled == false ) && ( joined == true ) ) )
      {
          bands[i].ReadyForTransmission = true;
          // This band is a potential candidate for an
          // upcoming transmission, so increase the counter.
          validBands++;
      }

msdade avatar Jan 24 '22 07:01 msdade

We are experiencing exactly this on the field! ~5% of the devices out of ~2k installed devices get silent every day. The stack is able to receive further downlinks but not send any uplinks. I also checked that every time it gets stuck it has received NbTrans: 2 or 3 beforehand. If any further information is needed, please let me know, so that it gets solved sooner!

lff5 avatar Jan 24 '22 17:01 lff5

Hi all. That was a false alert. We couldn't lock up stack with just NbTrans and duty-cycle. Our issue was on application side. I will still comment on it here to save others some time.

We noticed that every time the stack got stuck, it had received two consequent mac commands beforehand. Sending 2x 8-bytes mac command got the transmission blocked, because max 15 bytes is allowed. So a flush was needed. Nevertheless the correlation between Thingpark backend sending double commands and NbTrans>1 is very high.

Flushing of MAC commands: https://github.com/Lora-net/LoRaMac-node/blob/master/src/apps/LoRaMac/common/LmHandler/LmHandler.c#L483

lff5 avatar Feb 01 '22 17:02 lff5

To reproduce error i unplugged gw after join and sent confirmed messages one after other. When delay kicked in it was just a matter of minutes till DutyCycleWaitTime was 0 and whole thing just got stuck.

I kind of "fixed" by doing this...

        // Allow delayed transmissions. We have to allow it in case
        // the MAC must retransmit a frame with the frame repetitions
        if( MacCtx.DutyCycleWaitTime != 0 )
        {// Send later - prepare timer
            MacCtx.MacState |= LORAMAC_TX_DELAYED;
            TimerSetValue( &MacCtx.TxDelayedTimer, MacCtx.DutyCycleWaitTime + 1 );
            TimerStart( &MacCtx.TxDelayedTimer );
        }
        else
        {
            MacCtx.MacState |= LORAMAC_TX_DELAYED;
            TimerSetValue( &MacCtx.TxDelayedTimer, 1000 );
            TimerStart( &MacCtx.TxDelayedTimer );
        }

Seems that now it does not go to else branch. But i left it anyway to be on the safe side.

vecais-dumais-laacis avatar Feb 18 '22 11:02 vecais-dumais-laacis

I have spent the last few days tracking this same issue. My own conclusion is that it must return an error status distinct from LORAMAC_STATUS_DUTYCYCLE_RESTRICTED in case it hits the else case of if( MacCtx.DutyCycleWaitTime != 0 ) in order for ScheduleTx to stop retransmissions. That said, I also don't think this case should be hit at all since this happens after having already waited for the duty cycle restrictions to be lifted.

I will experiment to see if I am also hitting the potential off-by one comparison of credits found by msdade. Thank you vecais-dumais-laacis, for the tip on reproduction. I will try if it works for me. Until now I have been waiting between 7 and 60 minutes for the issue to strike.

abxy avatar Feb 23 '22 09:02 abxy

In RegionCommon.c before time credits are checked in RegionCommonUpdateBandTimeOff function there is called UpdateTimeCredits function. This function updates time credits. If there is over 1ms delay between firing the OnTxDelayedTimerEvent interrupt and arriving to UpdateTimeCredits function then this problem doesn't appear because there is always at least 1 extra credit and the band is set ready for transmission. If delay between interrupt and UpdateTimeCredits is 0 then this problem should appear.

msdade avatar Feb 23 '22 09:02 msdade

I figured a different way of recreating the issue which doesn't require unplugging the gateway:

  • Set DR = DR0 (to get most time on air per telegram)
  • Set ADR = off (so we stay in DR0)
  • Set NbTrans = 2
  • Send unconfirmed messages as often as the stack permits.

Modifying the credits check to use >= instead of > seems to completely resolve the issue. Good catch @msdade!

abxy avatar Feb 23 '22 16:02 abxy

I also tried to make the stack cleanly abort the transmission and signal McpsConfirm, instead of getting stuck by making it land in the default case of the switch in OnTxDelayedTimerEvent when this error occurs but that turned out to not work as expected because it just leaves MacState == LORAMAC_IDLE without calling McpsConfirm so my application got stuck instead.

An observation I made along the way is that the handling of delayed TX seems to break the pattern of setting a flag in the timer event handler and doing the processing in LoraMacProcess. Instead it does the processing directly in OnTxDelayedTimerEvent.

abxy avatar Feb 23 '22 16:02 abxy

I have recently debugged what appears to be the same issue. If you are still struggling with the problem, check #1288 and the pull request #1287.

janakj avatar Mar 25 '22 22:03 janakj

I figured a different way of recreating the issue which doesn't require unplugging the gateway:

  • Set DR = DR0 (to get most time on air per telegram)
  • Set ADR = off (so we stay in DR0)
  • Set NbTrans = 2
  • Send unconfirmed messages as often as the stack permits.

Modifying the credits check to use >= instead of > seems to completely resolve the issue. Good catch @msdade!

Hi @abxy ,

could you please provide details about which part you modified exactly (>= instead of >)?

djaeckle avatar May 11 '22 14:05 djaeckle

Hi @abxy ,

could you please provide details about which part you modified exactly (>= instead of >)?

@djaeckle This is how i patched it:

diff --git a/src/mac/region/RegionCommon.c b/src/mac/region/RegionCommon.c
index 36c3242a..8e71d6e3 100644
--- a/src/mac/region/RegionCommon.c
+++ b/src/mac/region/RegionCommon.c
@@ -336,7 +336,7 @@ TimerTime_t RegionCommonUpdateBandTimeOff( bool joined, Band_t* bands,
         // Check if the band is ready for transmission. Its ready,
         // when the duty cycle is off, or the TimeCredits of the band
         // is higher than the credit costs for the transmission.
-        if( ( bands[i].TimeCredits > creditCosts ) ||
+        if( ( bands[i].TimeCredits >= creditCosts ) ||
             ( ( dutyCycleEnabled == false ) && ( joined == true ) ) )
         {
             bands[i].ReadyForTransmission = true;
@@ -350,7 +350,7 @@ TimerTime_t RegionCommonUpdateBandTimeOff( bool joined, Band_t* bands,
             // for the next transmission.
             bands[i].ReadyForTransmission = false;
 
-            if( bands[i].MaxTimeCredits > creditCosts )
+            if( bands[i].MaxTimeCredits >= creditCosts )
             {
                 // The band can only be taken into account, if the maximum credits
                 // of the band are higher than the credit costs.

abxy avatar May 11 '22 15:05 abxy

Thanks for the patch. I will take a look at it.

djaeckle avatar May 12 '22 07:05 djaeckle