artiq icon indicating copy to clipboard operation
artiq copied to clipboard

Sayma AMC: Exception(LoadFault) caught when writing JESD-related config

Open HarryMakes opened this issue 3 years ago • 4 comments

(Note: this is a preliminary bug report, information is incomplete)

Bug Report

One-Line Summary

With c940f104f16286ae643ef59f38a20d59bde9a239 and a WIP fix (will be disclosed later) for DRTIO, on Sayma AMC the CPU will always panic at (apparently) board_misoc::config::append_at() when data for the config key sysref_ddmtd_phase_fpga is absent.

Issue Details

Steps to Reproduce

(to be elaborated)

  1. Make sure to erase the entire flash on Sayma AMC.
  2. Reflash the ARTIQ-7 dev gateware and firmware to Sayma AMC, and reboot.

Expected Behavior

The calibration config values should be written properly to the SPI flash without panic.

Actual (undesired) Behavior

The firmware panics with the following serial log (NB the exception and the PC value where it is caught are always the same for the same piece of firmware):

Loading slave FPGA gateware...
  magic: 0x5352544d, length: 0x0016e760
  ...done
Booting from flash...
Starting firmware.
[     0.000008s]  INFO(satman): ARTIQ satellite manager starting...
[     0.005192s]  INFO(satman): software ident 7.unknown.beta;satellite.pattern
[     0.012148s]  INFO(satman): gateware ident 7.unknown.beta;satellite.pattern
[     0.271707s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
[     2.176172s]  INFO(board_artiq::si5324):   ...locked
[     2.252095s]  INFO(satman::repeater): [REP#0] link RX became up, pinging
[     2.294086s]  INFO(satman): uplink is up, switching to recovered clock
[     2.327013s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
[     3.934913s]  INFO(board_artiq::si5324):   ...locked
[     6.324969s]  INFO(board_artiq::si5324::siphaser): calibration successful, lead: 29, width: 432 (347deg)
[     8.158089s]  INFO(satman::repeater): [REP#0] remote replied after 19 packets
[     8.188237s]  INFO(satman::jdcg::jdac): DAC-0 initializing...
[     8.745984s]  INFO(satman::jdcg::jdac):   ...done initializing
[     8.750420s]  INFO(satman::jdcg::jdac): DAC-1 initializing...
[     9.309450s]  INFO(satman::jdcg::jdac):   ...done initializing
[     9.313887s]  INFO(satman::jdcg::jesd204sync): testing DDMTD stability (raw=true, tolerance=4)...
[     9.643520s]  INFO(satman::jdcg::jesd204sync):   ...passed, peak-peak jitter: 4
[     9.649433s]  INFO(satman::jdcg::jesd204sync): testing DDMTD stability (raw=false, tolerance=1)...
[    10.138147s]  INFO(satman::jdcg::jesd204sync):   ...passed, peak-peak jitter: 1
[    10.144063s]  INFO(satman::jdcg::jesd204sync): testing HMC7043 SYSREF slip against DDMTD...
[    11.280594s]  INFO(satman::jdcg::jesd204sync):   ...passed
[    11.284685s]  INFO(satman::jdcg::jesd204sync): determining SYSREF S/H limits...
[    11.857149s]  INFO(satman::jdcg::jesd204sync):   SYSREF S/H average limits (DDMTD phases): 57 121
[    11.864633s]  INFO(satman::jdcg::jesd204sync):   SYSREF S/H maximum limit deviation: 3 3
[    11.872706s]  INFO(satman::jdcg::jesd204sync):   ...done
[    11.878016s]  INFO(satman::jdcg::jesd204sync): calibrating SYSREF DDMTD target phase...
[    11.885992s]  INFO(satman::jdcg::jesd204sync):   SYSREF calibration coarse target: 89
[    11.907735s]  INFO(satman::jdcg::jesd204sync):   ...done, target=97
@ 0x40009700
+0000: 00052583 fff40413 00450513 fe041ae3
+0010: 00000613 013485b3 00baa223 00caa023
+0020: 00412c83 00812c03 00c12b83 01012b03
+0030: 01412a83 01812a03 01c12983 02012903
panic at satman/main.rs:674:5: exception Exception(LoadFault) at PC 0x40009700, trap value 0x40029000

Output from llvm-objump:

40009600 <_ZN11board_misoc6config3imp9append_at17h077770fffe9a2d09E>:
40009600: 13 01 01 fd   addi    sp, sp, -48
40009604: 23 26 11 02   sw      ra, 44(sp)
40009608: 23 24 81 02   sw      s0, 40(sp)
4000960c: 23 22 91 02   sw      s1, 36(sp)
40009610: 23 20 21 03   sw      s2, 32(sp)
40009614: 23 2e 31 01   sw      s3, 28(sp)
40009618: 23 2c 41 01   sw      s4, 24(sp)
4000961c: 23 2a 51 01   sw      s5, 20(sp)
40009620: 23 28 61 01   sw      s6, 16(sp)
40009624: 23 26 71 01   sw      s7, 12(sp)
40009628: 23 24 81 01   sw      s8, 8(sp)
4000962c: 23 22 91 01   sw      s9, 4(sp)
40009630: 13 8b 06 00   mv      s6, a3
40009634: 13 0a 06 00   mv      s4, a2
40009638: 93 84 05 00   mv      s1, a1
4000963c: 93 0a 05 00   mv      s5, a0
40009640: 33 85 f6 00   add     a0, a3, a5
40009644: 13 05 55 00   addi    a0, a0, 5
40009648: b3 06 b5 00   add     a3, a0, a1
4000964c: 93 05 10 00   addi    a1, zero, 1
40009650: 37 04 01 00   lui     s0, 16
40009654: 13 06 10 00   addi    a2, zero, 1
40009658: 63 60 d4 0c   bltu    s0, a3, 192 <_ZN11board_misoc6config3imp9append_at17h077770fffe9a2d09E+0x118>
4000965c: 93 89 07 00   mv      s3, a5
40009660: 13 09 07 00   mv      s2, a4
40009664: 93 55 85 00   srli    a1, a0, 8
40009668: 13 06 04 f0   addi    a2, s0, -256
4000966c: b3 f5 c5 00   and     a1, a1, a2
40009670: 13 56 85 01   srli    a2, a0, 24
40009674: b3 e5 c5 00   or      a1, a1, a2
40009678: 13 16 85 00   slli    a2, a0, 8
4000967c: b7 06 ff 00   lui     a3, 4080
40009680: 33 76 d6 00   and     a2, a2, a3
40009684: 13 15 85 01   slli    a0, a0, 24
40009688: 33 65 c5 00   or      a0, a0, a2
4000968c: 33 65 b5 00   or      a0, a0, a1
40009690: 23 20 a1 00   sw      a0, 0(sp)
40009694: b7 0b 04 00   lui     s7, 64
40009698: 33 85 74 01   add     a0, s1, s7
4000969c: 93 05 01 00   mv      a1, sp
400096a0: 13 06 40 00   addi    a2, zero, 4
400096a4: 97 10 00 00   auipc   ra, 1
400096a8: e7 80 40 cb   jalr    -844(ra)
400096ac: 33 8c 64 01   add     s8, s1, s6
400096b0: 93 8c 4b 00   addi    s9, s7, 4
400096b4: 33 85 94 01   add     a0, s1, s9
400096b8: 93 05 0a 00   mv      a1, s4
400096bc: 13 06 0b 00   mv      a2, s6
400096c0: 97 10 00 00   auipc   ra, 1
400096c4: e7 80 80 c9   jalr    -872(ra)
400096c8: 33 05 9c 01   add     a0, s8, s9
400096cc: b7 65 01 40   lui     a1, 262166
400096d0: 93 85 45 97   addi    a1, a1, -1676
400096d4: 13 06 10 00   addi    a2, zero, 1
400096d8: 97 10 00 00   auipc   ra, 1
400096dc: e7 80 00 c8   jalr    -896(ra)
400096e0: 93 04 5c 00   addi    s1, s8, 5
400096e4: 13 85 5b 00   addi    a0, s7, 5
400096e8: 33 05 ac 00   add     a0, s8, a0
400096ec: 93 05 09 00   mv      a1, s2
400096f0: 13 86 09 00   mv      a2, s3
400096f4: 97 10 00 00   auipc   ra, 1
400096f8: e7 80 40 c6   jalr    -924(ra)
400096fc: 37 05 00 40   lui     a0, 262144
40009700: 83 25 05 00   lw      a1, 0(a0)
40009704: 13 04 f4 ff   addi    s0, s0, -1
40009708: 13 05 45 00   addi    a0, a0, 4
4000970c: e3 1a 04 fe   bnez    s0, -12 <_ZN11board_misoc6config3imp9append_at17h077770fffe9a2d09E+0x100>
40009710: 13 06 00 00   mv      a2, zero
40009714: b3 85 34 01   add     a1, s1, s3
40009718: 23 a2 ba 00   sw      a1, 4(s5)
4000971c: 23 a0 ca 00   sw      a2, 0(s5)
40009720: 83 2c 41 00   lw      s9, 4(sp)
40009724: 03 2c 81 00   lw      s8, 8(sp)
40009728: 83 2b c1 00   lw      s7, 12(sp)
4000972c: 03 2b 01 01   lw      s6, 16(sp)
40009730: 83 2a 41 01   lw      s5, 20(sp)
40009734: 03 2a 81 01   lw      s4, 24(sp)
40009738: 83 29 c1 01   lw      s3, 28(sp)
4000973c: 03 29 01 02   lw      s2, 32(sp)
40009740: 83 24 41 02   lw      s1, 36(sp)
40009744: 03 24 81 02   lw      s0, 40(sp)
40009748: 83 20 c1 02   lw      ra, 44(sp)
4000974c: 13 01 01 03   addi    sp, sp, 48
40009750: 67 80 00 00   ret

HarryMakes avatar Oct 29 '21 09:10 HarryMakes

Stack overflow? The new firmware should detect those instead of corrupting memory. This may just be one such detection. @occheung

sbourdeauducq avatar Oct 29 '21 09:10 sbourdeauducq

Stack overflow? The new firmware should detect those instead of corrupting memory. This may just be one such detection. @occheung

It would be useful to know the value of _sstack_guard in the elf file. That should be the only protected region in satman.

occheung avatar Oct 29 '21 09:10 occheung

Here's a small incremental input I can give this discussion at the moment:

  • @occheung _sstack_guard of satman is 0x4002a000.
  • It's likely the problematic instruction refers to flush_l2_cache(), which reads 0x40000000 - 0x40040000 (L2 size being 128*1024 = 0x20000 bytes).
    • By reading https://github.com/m-labs/artiq/pull/1764, the bootloader is exempt from stack guards. ~~Meanwhile, I can only find calls to flush_l2_cache() in the bootloader elsewhere, and it's not called anywhere in the firmware.~~ (Correction: it's also used in analyzer and session.)
    • I'm not familiar with the purpose of flushing L2 upon writing/erasing the SPI flash, and I wonder if it is a thing to "uninit" the stack guard in this situation.
    • I should spend some time to read about PMP/guard pages.

HarryMakes avatar Nov 02 '21 04:11 HarryMakes

@occheung _sstack_guard of satman is 0x4002a000.

Either the code should now panics at 0x4002a000, or the _sstack_guard was 0x40029000. The PMP region should only be 0x1000 large, otherwise we have an issue with the VexRiscv CPU / PMP config (again).

I'm not familiar with the purpose of flushing L2 upon writing/erasing the SPI flash, and I wonder if it is a thing to "uninit" the stack guard in this situation.

Ah I see the issue. So the firmware is small enough, which will unintentionally triggered the stack guard when flushing the L2 cache.

Currently, the PMP in satman/kernel is enabled by locking the register in machine mode. It is not possible to turn it off. Some of the PMP regions in runtime can be turned on/off because it is enabled by switching to a lower privilege level (e.g. from machine to user). It was implemented for the spawned threads (libfringe).

occheung avatar Nov 02 '21 06:11 occheung