arduino-lorawan icon indicating copy to clipboard operation
arduino-lorawan copied to clipboard

Kernel panic on Arduino v2.0

Open olicooper opened this issue 4 years ago • 21 comments

The library worked fine on Arduino 1.0.6, but after upgrading to v2.0.0 the device reboots on the next loop cycle after calling Arduino_LoRaWAN::SendBuffer(). I have a Lora wrapper class which simply calls Arduino_LoRaWAN::loop() - it is failing at this point.

environment:

  • Version 0.9.1
  • Ardunio v2.0.0 (I also tried the mem-optimized branch)
  • PlatformIO on Windows 10
  • Heltec Wireless Stick v1.2

code

void Lora::NetBeginRegionInit() {
    this->Super::NetBeginRegionInit();
}

void Lora::NetJoin() {
    this->Super::NetJoin();
    SetLinkCheckMode(true);
    LMIC_setPingable(4);
}

void Lora::NetTxComplete() {
    this->Super::NetTxComplete();
}

bool Lora::GetOtaaProvisioningInfo(OtaaProvisioningInfo *pInfo) {
    if (pInfo != nullptr) {
        memcpy_P(pInfo->AppKey, APPKEY, sizeof(pInfo->AppKey));
        memcpy_P(pInfo->DevEUI, DEVEUI, sizeof(pInfo->DevEUI));
        memcpy_P(pInfo->AppEUI, APPEUI, sizeof(pInfo->AppEUI));
    }
    return true;
}

void Lora::loop() {
    if (!enabled) return;
    Super::loop(); // <-- Failing here
}
  • My application flow is:
// these are called in setup
begin(&pinMap);
os_setCallback(&basicStatePublishJob, processLMICCallback);

// ... then 'basicStatePublishJob' calls SendBuffer
SendBuffer(dataBuffer, dataBuffer_s, pDoneFn, (void *)this, /*confirmed*/ 0, /*port*/ 1);

core dump

Guru Meditation Error: Core  1 panic'ed (LoadStoreError). Exception was unhandled.

Core  1 register dump:
PC      : 0x400f4ed7  PS      : 0x00060030  A0      : 0x800f4f55  A1      : 0x3ffce000
A2      : 0x4002f56e  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x0013a16a  
A6      : 0xfffffffc  A7      : 0x00000003  A8      : 0x00000000  A9      : 0x3ffc7c18
A10     : 0x00000010  A11     : 0x00000016  A12     : 0x00000000  A13     : 0x00040df7  
A14     : 0x63ffffff  A15     : 0x00000003  SAR     : 0x0000001b  EXCCAUSE: 0x00000003
EXCVADDR: 0x4002f56e  LBEG    : 0x400f4e88  LEND    : 0x400f4ee6  LCOUNT  : 0x00000004  

0x400f4ed4: Arduino_LoRaWAN::BuildSessionState(Arduino_LoRaWAN::SessionState_u&) const at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:369
 (inlined by) ?? at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:359
 (inlined by) Arduino_LoRaWAN::BuildSessionState(Arduino_LoRaWAN::SessionState_u&) const at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_sessionstate.cpp:114
0x400f4f52: Arduino_LoRaWAN::SaveSessionState() at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_sessionstate.cpp:190
0x400f4b0b: Arduino_LoRaWAN::StandardEventProcessor(unsigned int) at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:925
 (inlined by) Arduino_LoRaWAN::StandardEventProcessor(unsigned int) at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_begin.cpp:214
0x400f4be2: Arduino_LoRaWAN::DispatchEvent(unsigned int) at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_begin.cpp:123
0x400f4c24: onEvent at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_begin.cpp:109
0x400f07fd: reportEventNoUpdate at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI LoRaWAN LMIC library/src/lmic/lmic.c:481
0x400f1d3d: reportEventAndUpdate at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI LoRaWAN LMIC library/src/lmic/lmic.c:466
0x400f203a: startJoining at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI LoRaWAN LMIC library/src/lmic/lmic.c:2142
0x400f308c: os_runloop_once at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI LoRaWAN LMIC library/src/lmic/oslmic.c:157
0x400f4d6b: Arduino_LoRaWAN::loop() at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_loop.cpp:42
0x400d5c03: Lora::loop() at F:\myproject/src/lora.cpp:156
0x400d680e: loop() at F:\myproject/src/main.cpp:196
0x400fb8c4: loopTask(void*) at C:/Users/Oliver/.platformio/packages/framework-arduinoespressif32/cores/esp32/main.cpp:46

Related source code

https://github.com/mcci-catena/arduino-lorawan/blob/caceeb49b1f5520d11668aeb7c587ef1a96ca9ad/src/lib/arduino_lorawan_sessionstate.cpp#L190

https://github.com/mcci-catena/arduino-lorawan/blob/caceeb49b1f5520d11668aeb7c587ef1a96ca9ad/src/lib/arduino_lorawan_sessionstate.cpp#L114

https://github.com/mcci-catena/arduino-lorawan/blob/caceeb49b1f5520d11668aeb7c587ef1a96ca9ad/src/Arduino_LoRaWAN.h#L359-L371

olicooper avatar Oct 15 '21 19:10 olicooper

@olicooper sorry for the trouble. Thanks for the detailed report. Must be an portability problem revealed by the new compiler. The code looks correct, so you might have found a compiler bug. If you can get the assembly-language instructions preceding the fault, it will help with diagnosis. I don't have a test platform, unfortunately.

terrillmoore avatar Oct 15 '21 20:10 terrillmoore

Thanks for the quick response. I will have to look in to how I can the the ASL for you because I don't have a debugging facility for the device and I haven't done it before. I created a simple platformio project earlier today which shows how I got Arduino v2.0.0 working with PlatformIO which might help you. The readme has detail on the compiler versions etc.

olicooper avatar Oct 15 '21 20:10 olicooper

We haven't tested on Arduino v2.0.0, and I unfortunately don't even use PlatformIO, though of course we support it -- it doesn't have support for our BSPs. And.. this is the first I've heard of Arduino 2.0. So I'm really out in the weeds and not able to invest much time short term.

You can possibly use objdump on the file arduino_lorawan_sessionstate.o to get the disassembly. But it depends on the tool chain. You'll have to research the command line switches on your version. objdump will be with gcc. There is an option to disassmble, and to include line numbers, and possibly even to include source code. The .o file will be somewhere in your build directory.

terrillmoore avatar Oct 15 '21 20:10 terrillmoore

I see this is Arduino v2 -- which is still beta. That's good, because if you can identify a compiler issue, they can back up to a working release or try to get a fix. Since freq is a uint8_t &, the compiler should probably not be making assumptions about alignment; but from the fault is clearly is. (It's doing a store-halfword to an odd addresss, almost certainly.)

This happens from time to time; the compiler writers find a new trick that ought to work, is arguably portable; and then we find out where that breaks existing code. There is in fact a union involved in this part of the code, and it's possible that this is confusing the compiler or me into thinking that an operation is safe. No pointers involved so I think the compiler has dropped a stitch, but it's possible I've once again been too clever by half.

Were there no templates involved, I would say that you should try splitting out the inline (since you're on an ESP32, there's no real need). But... there's a template... no wonder the compiler is doing the wrong thing.

But the template and channel numbers were only to do type checking, so we can refactor.

In the header file:

// outside the template, e.g. at line 275:
static bool setFrequencyRaw(uint8_t *pFreq, unsigned iCh, uint32t_t frequency);

Then change setFrequency:

bool setFrequency(uint8_t (&freq)[nCh * 3], unsigned iCh, uint32_t frequency)
    {
    // check the template parameter while it's in scope.
    if (iCh > nCh)
           return false;

    // call an external function to make it harder for optimizer to do the wrong thing (for test)
    return Arduino_LoRaWAN::setFrequencyHelper(freq, ich, frequency);
    }

And then in a separate C++ file:

// this is the former body of setFrequency(), with &freq changed to pFreq
/* static */
bool Arduino_LoRaWAN::setFrequencyRaw(uint8_t *pFreq, unsigned iCh, uint32_t frequency)
                        {
                        const uint32_t reducedFreq = frequency / 100;
                        if (reducedFreq > 0xFFFFFFu)
                                return false;

                        auto const chPtr = pFreq + iCh * 3;
                        chPtr[0] = uint8_t(reducedFreq >> 16);
                        chPtr[1] = uint8_t(reducedFreq >> 8);
                        chPtr[2] = uint8_t(reducedFreq);
                        }

If this fixes the problem, it may make it easier for the compiler people to either tell me what I've gotten wrong, or for them to figure out what has gone wrong.

terrillmoore avatar Oct 15 '21 21:10 terrillmoore

Sorry found an error in review; moved an if from the external function to the inline.

terrillmoore avatar Oct 15 '21 21:10 terrillmoore

I've managed to generate some assembly code for you but there are too many lines to strip sensitive info so I have emailed them to you instead. I hope that is okay?

How to get assembly code

For community reference, I was able to use Espressif's SDK to get the assembly code. It looks like it outputs the source code too. There are multiple ways of doing it but I used the following methods.

PlatformIO (the one I used)

  1. Add build flags to your [env:<env_name>] in platformio.ini
    build_flags = 
           -save-temps
           -fverbose-asm
    
  2. Clean and build as usual! You'll see a tonne of .s/.i/*.ii files in the project root directory.. I moved them with a PS command like mv *.i,*.ii,*.s .\disassembly\

Command line

Either run the command on the firmware.elf or on an individual object file (but this won't include all code after linking?)..

  1. <espressif_sdk_dir>\toolchain-xtensa-esp32\bin\xtensa-esp32-elf-objdump.exe -S -d "<project_build_dir>\MCCI Arduino LoRaWAN Library\lib\arduino_lorawan_sessionstate.cpp.o" > disassembly.txt

olicooper avatar Oct 17 '21 14:10 olicooper

@olicooper email is fine, but it hasn't come yet. Checked spam filters also, nothing there.

terrillmoore avatar Oct 17 '21 16:10 terrillmoore

Sorry about that, I think gmail is causing the issue. I've resent it.

olicooper avatar Oct 17 '21 17:10 olicooper

got it.

terrillmoore avatar Oct 17 '21 17:10 terrillmoore

can you send the corresponding link map?

terrillmoore avatar Oct 17 '21 17:10 terrillmoore

The problem is not what I thought. The code for doing the store is correct. Removing the debug .loc stuff:

# .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:368:                         chPtr[0] = uint8_t(reducedFreq >> 16);
	s8i	a4, a2, 0	# *chPtr_113, tmp200
# .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:369:                         chPtr[1] = uint8_t(reducedFreq >> 8);
	s8i	a12, a2, 1	# MEM[(unsigned char *)chPtr_113 + 1B], tmp201
# .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:370:                         chPtr[2] = uint8_t(reducedFreq);
	s8i	a3, a2, 2	#, reducedFreq

Those are three byte writes to the pointer. So I need the link map to find exactly where things are failing.

But... could this be related to initialization and switching back and forth (described in other report)? The LMIC does not do much checking for invalid state, and since the compiled code is correct, my guess is that some preconditions we are depending on are not true.

terrillmoore avatar Oct 17 '21 17:10 terrillmoore

In my code I read from NVS storage during boot to decide if I should start the lora 'module' or not. If I have enabled the module it will start normally by calling the required Arduino_LoRaWAN::begin(&pinMap); function, then calling loop() as normal. The only other thing I do is test if there is internet connection and then start the lora module if there isn't - the idea being that it can act as a backup communication method. However, there is no runtime switching happening - I have to reboot the device to switch it on/off. I should note that the code works fine with Arduino 1.0.6. I have sent you an updated email with all output. Let me know if there is anything else you need me to do. Thank you!

olicooper avatar Oct 17 '21 17:10 olicooper

To simplify things I have created a simple reproducible example for you which can be found here on the lorawan-test branch: https://github.com/olicooper/arduino-v2-esp32-platformio/tree/lorawan-test I will also email you the elf and assembly code.

========= BOOT INFO =========

MAC: FF:FF:FF:FF:FF:FF
BootPart: offset=0x10000 size=0x140000
IDF version: v4.4-dev-2313-gc69f0ec32
Free memory: 241704 bytes

=============================

NetBeginRegionInit
Publishing state
Guru Meditation Error: Core  1 panic'ed (LoadStoreError). Exception was unhandled.

Core  1 register dump:
PC      : 0x400d5e97  PS      : 0x00060330  A0      : 0x800d5f17  A1      : 0x3ffb25b0
A2      : 0x40013b1e  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x000028d8
A6      : 0xfffffffc  A7      : 0x00000003  A8      : 0x00000000  A9      : 0x3ffc0f94
A10     : 0x00000010  A11     : 0x00000016  A12     : 0x00000000  A13     : 0x00040df7
A14     : 0x63ffffff  A15     : 0x00000003  SAR     : 0x0000001b  EXCCAUSE: 0x00000003
EXCVADDR: 0x40013b1e  LBEG    : 0x400d5e48  LEND    : 0x400d5ea6  LCOUNT  : 0x00000004


Backtrace:0x400d5e94:0x3ffb25b00x400d5f14:0x3ffb25e0 0x400d5aeb:0x3ffb26e0 0x400d5bbe:0x3ffb2700 0x400d5c00:0x3ffb2720 0x400d1855:0x3ffb2740 0x400d2d89:0x3ffb2760 0x400d3086:0x3ffb2780 0x400d40bc:0x3ffb27a0 0x400d5d3b:0x3ffb27c0 0x400d14a9:0x3ffb27e0 0x400d15fe:0x3ffb2800 0x400d7900:0x3ffb2820
  #0  0x400d5e94:0x3ffb25b00 in Arduino_LoRaWAN::SessionChannelMask_EU_like<16u, 4u>::setFrequency(unsigned char (&) [48], unsigned int, unsigned int) at .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:369
      (inlined by) Arduino_LoRaWAN::SessionChannelMask_EU_like<16u, 4u>::setFrequency(unsigned char (&) [48], unsigned int, unsigned int) at .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:359
      (inlined by) Arduino_LoRaWAN::BuildSessionState(Arduino_LoRaWAN::SessionState_u&) const at .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_sessionstate.cpp:114

olicooper avatar Oct 18 '21 13:10 olicooper

I know you said that the problem wasn't what you originally thought, but I tried the modifications you suggested above https://github.com/mcci-catena/arduino-lorawan/issues/181#issuecomment-944688542 and it fixed the issue!.. so I have updated the lmic code here https://github.com/olicooper/arduino-lorawan/tree/arduino-v2-fix and I will try the changes in my main project as I haven't fully tested it yet. Great work! What's the next step? Is it possible for a patch be written to your library to help in the interim if this is the culprit?

olicooper avatar Oct 19 '21 11:10 olicooper

Well, there's no real problem with a patch, but... it looks like a compiler issue to me, and should be brought to their attention, as it seems unlikely that this would only affect my code (out of the whole world). First, however, I can believe, especially as this is a beta project. Since the projects are all open source, it should not be necessary to do the usual heroic efforts needed to demo a compiler issue to the maintainers. (With global optimization, anyway, tomorrow they could apply the same change globally and crash the library again.)

terrillmoore avatar Oct 19 '21 16:10 terrillmoore

Hi @olicooper and @terrillmoore, I'd like to point out one thing from the original crash report:

Guru Meditation Error: Core  1 panic'ed (LoadStoreError). Exception was unhandled.

Core  1 register dump:
PC      : 0x400d5e97  PS      : 0x00060330  A0      : 0x800d5f17  A1      : 0x3ffb25b0
A2      : 0x40013b1e  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x000028d8
A6      : 0xfffffffc  A7      : 0x00000003  A8      : 0x00000000  A9      : 0x3ffc0f94
A10     : 0x00000010  A11     : 0x00000016  A12     : 0x00000000  A13     : 0x00040df7
A14     : 0x63ffffff  A15     : 0x00000003  SAR     : 0x0000001b  EXCCAUSE: 0x00000003
EXCVADDR: 0x40013b1e  LBEG    : 0x400d5e48  LEND    : 0x400d5ea6  LCOUNT  : 0x00000004

Here EXCVADDR points to the memory address which the load or store instruction has tried to access. On the ESP32, 0x40013b1e is an address inside ROM, in the middle of .text section. It looks very odd that the code of this library would be performing a load or especially store from this address. It doesn't look like an alignment issue, because even if the alignment was correct, it wouldn't make sense to access (read or write) this ROM location. To me it looks like memory corruption, where some pointer gets overwritten by a garbage value, and then the value gets dereferenced.

@olicooper since you have a reproducer for this issue, could you please upload the ELF file without the workaround and the matching crash dump? I am sorry that we aren't able to reproduce based on your example, because it requires PlatformIO and seemingly a LoRa module...

igrr avatar Oct 20 '21 08:10 igrr

@igrr I mentioned on the other ticket that this should be the correct ELF, but I will recompile the elf again against the Arduino main branch to be sure. You are correct that you'd need a LoRa module to execute this code which does make it more challenging. The odd thing about it is that I can compile the exact same code on Arduino v1.0.6 and it works perfectly. Switching to v2.0.0 (and subsequently IDF 4.3) cases the crash.

olicooper avatar Oct 20 '21 09:10 olicooper

Thanks, I have checked the ELF you posted, and like @terrillmoore has observed here I don't see anything wrong with code generation, at least as far as alignment is concerned.

The PC at the point of the exception points to:

   0x400d5e97 <+319>:	s8i	a4, a2, 0

which is a store instruction, which should most definitely not be trying to write to an area in ROM. I have tried tracing where the corrupted value could come from by looking at the assembly code, but didn't get too far unfortunately.

A compiler bug is a rare occasion but if there is a strong hint that one is present we will definitely investigate this. However, at the moment it looks more like heap corruption. Upgrading from one Arduino-esp32 and IDF release to another can change many things, including the layout of objects in heap. This might cause, for example, an out-of-bounds write to show up in such a way, while previously it could be corrupting some padding space or a mostly-unused variable.

igrr avatar Oct 20 '21 11:10 igrr

Thanks for posting the two ELF files, I will definitely look more into the differences between them.

igrr avatar Oct 20 '21 11:10 igrr

@terrillmoore Quick update, I have tested against an the v4.0.2 release of PlatformIO today. It still seems to be broken, but I get an assert issue instead of a kernel panic now.. https://github.com/espressif/arduino-esp32/issues/5783#issuecomment-1116121168

The assert that is failing is this: https://github.com/mcci-catena/arduino-lmic/blob/8d378ea410887d3fb08ea2c9acce95cc4c047788/src/lmic/radio.c#L1164

olicooper avatar May 03 '22 14:05 olicooper

The assert issue is due to calling radio_rand1() before the buffer is initialized. Not sure why that might be happening.

terrillmoore avatar May 03 '22 18:05 terrillmoore