artiq
artiq copied to clipboard
Sayma AMC: Exception(LoadFault) caught when writing JESD-related config
(Note: this is a preliminary bug report, information is incomplete)
Bug Report
One-Line Summary
With c940f104f16286ae643ef59f38a20d59bde9a239 and a WIP fix (will be disclosed later) for DRTIO, on Sayma AMC the CPU will always panic at (apparently) board_misoc::config::append_at()
when data for the config key sysref_ddmtd_phase_fpga
is absent.
Issue Details
Steps to Reproduce
(to be elaborated)
- Make sure to erase the entire flash on Sayma AMC.
- Reflash the ARTIQ-7 dev gateware and firmware to Sayma AMC, and reboot.
Expected Behavior
The calibration config values should be written properly to the SPI flash without panic.
Actual (undesired) Behavior
The firmware panics with the following serial log (NB the exception and the PC value where it is caught are always the same for the same piece of firmware):
Loading slave FPGA gateware...
magic: 0x5352544d, length: 0x0016e760
...done
Booting from flash...
Starting firmware.
[ 0.000008s] INFO(satman): ARTIQ satellite manager starting...
[ 0.005192s] INFO(satman): software ident 7.unknown.beta;satellite.pattern
[ 0.012148s] INFO(satman): gateware ident 7.unknown.beta;satellite.pattern
[ 0.271707s] INFO(board_artiq::si5324): waiting for Si5324 lock...
[ 2.176172s] INFO(board_artiq::si5324): ...locked
[ 2.252095s] INFO(satman::repeater): [REP#0] link RX became up, pinging
[ 2.294086s] INFO(satman): uplink is up, switching to recovered clock
[ 2.327013s] INFO(board_artiq::si5324): waiting for Si5324 lock...
[ 3.934913s] INFO(board_artiq::si5324): ...locked
[ 6.324969s] INFO(board_artiq::si5324::siphaser): calibration successful, lead: 29, width: 432 (347deg)
[ 8.158089s] INFO(satman::repeater): [REP#0] remote replied after 19 packets
[ 8.188237s] INFO(satman::jdcg::jdac): DAC-0 initializing...
[ 8.745984s] INFO(satman::jdcg::jdac): ...done initializing
[ 8.750420s] INFO(satman::jdcg::jdac): DAC-1 initializing...
[ 9.309450s] INFO(satman::jdcg::jdac): ...done initializing
[ 9.313887s] INFO(satman::jdcg::jesd204sync): testing DDMTD stability (raw=true, tolerance=4)...
[ 9.643520s] INFO(satman::jdcg::jesd204sync): ...passed, peak-peak jitter: 4
[ 9.649433s] INFO(satman::jdcg::jesd204sync): testing DDMTD stability (raw=false, tolerance=1)...
[ 10.138147s] INFO(satman::jdcg::jesd204sync): ...passed, peak-peak jitter: 1
[ 10.144063s] INFO(satman::jdcg::jesd204sync): testing HMC7043 SYSREF slip against DDMTD...
[ 11.280594s] INFO(satman::jdcg::jesd204sync): ...passed
[ 11.284685s] INFO(satman::jdcg::jesd204sync): determining SYSREF S/H limits...
[ 11.857149s] INFO(satman::jdcg::jesd204sync): SYSREF S/H average limits (DDMTD phases): 57 121
[ 11.864633s] INFO(satman::jdcg::jesd204sync): SYSREF S/H maximum limit deviation: 3 3
[ 11.872706s] INFO(satman::jdcg::jesd204sync): ...done
[ 11.878016s] INFO(satman::jdcg::jesd204sync): calibrating SYSREF DDMTD target phase...
[ 11.885992s] INFO(satman::jdcg::jesd204sync): SYSREF calibration coarse target: 89
[ 11.907735s] INFO(satman::jdcg::jesd204sync): ...done, target=97
@ 0x40009700
+0000: 00052583 fff40413 00450513 fe041ae3
+0010: 00000613 013485b3 00baa223 00caa023
+0020: 00412c83 00812c03 00c12b83 01012b03
+0030: 01412a83 01812a03 01c12983 02012903
panic at satman/main.rs:674:5: exception Exception(LoadFault) at PC 0x40009700, trap value 0x40029000
Output from llvm-objump
:
40009600 <_ZN11board_misoc6config3imp9append_at17h077770fffe9a2d09E>:
40009600: 13 01 01 fd addi sp, sp, -48
40009604: 23 26 11 02 sw ra, 44(sp)
40009608: 23 24 81 02 sw s0, 40(sp)
4000960c: 23 22 91 02 sw s1, 36(sp)
40009610: 23 20 21 03 sw s2, 32(sp)
40009614: 23 2e 31 01 sw s3, 28(sp)
40009618: 23 2c 41 01 sw s4, 24(sp)
4000961c: 23 2a 51 01 sw s5, 20(sp)
40009620: 23 28 61 01 sw s6, 16(sp)
40009624: 23 26 71 01 sw s7, 12(sp)
40009628: 23 24 81 01 sw s8, 8(sp)
4000962c: 23 22 91 01 sw s9, 4(sp)
40009630: 13 8b 06 00 mv s6, a3
40009634: 13 0a 06 00 mv s4, a2
40009638: 93 84 05 00 mv s1, a1
4000963c: 93 0a 05 00 mv s5, a0
40009640: 33 85 f6 00 add a0, a3, a5
40009644: 13 05 55 00 addi a0, a0, 5
40009648: b3 06 b5 00 add a3, a0, a1
4000964c: 93 05 10 00 addi a1, zero, 1
40009650: 37 04 01 00 lui s0, 16
40009654: 13 06 10 00 addi a2, zero, 1
40009658: 63 60 d4 0c bltu s0, a3, 192 <_ZN11board_misoc6config3imp9append_at17h077770fffe9a2d09E+0x118>
4000965c: 93 89 07 00 mv s3, a5
40009660: 13 09 07 00 mv s2, a4
40009664: 93 55 85 00 srli a1, a0, 8
40009668: 13 06 04 f0 addi a2, s0, -256
4000966c: b3 f5 c5 00 and a1, a1, a2
40009670: 13 56 85 01 srli a2, a0, 24
40009674: b3 e5 c5 00 or a1, a1, a2
40009678: 13 16 85 00 slli a2, a0, 8
4000967c: b7 06 ff 00 lui a3, 4080
40009680: 33 76 d6 00 and a2, a2, a3
40009684: 13 15 85 01 slli a0, a0, 24
40009688: 33 65 c5 00 or a0, a0, a2
4000968c: 33 65 b5 00 or a0, a0, a1
40009690: 23 20 a1 00 sw a0, 0(sp)
40009694: b7 0b 04 00 lui s7, 64
40009698: 33 85 74 01 add a0, s1, s7
4000969c: 93 05 01 00 mv a1, sp
400096a0: 13 06 40 00 addi a2, zero, 4
400096a4: 97 10 00 00 auipc ra, 1
400096a8: e7 80 40 cb jalr -844(ra)
400096ac: 33 8c 64 01 add s8, s1, s6
400096b0: 93 8c 4b 00 addi s9, s7, 4
400096b4: 33 85 94 01 add a0, s1, s9
400096b8: 93 05 0a 00 mv a1, s4
400096bc: 13 06 0b 00 mv a2, s6
400096c0: 97 10 00 00 auipc ra, 1
400096c4: e7 80 80 c9 jalr -872(ra)
400096c8: 33 05 9c 01 add a0, s8, s9
400096cc: b7 65 01 40 lui a1, 262166
400096d0: 93 85 45 97 addi a1, a1, -1676
400096d4: 13 06 10 00 addi a2, zero, 1
400096d8: 97 10 00 00 auipc ra, 1
400096dc: e7 80 00 c8 jalr -896(ra)
400096e0: 93 04 5c 00 addi s1, s8, 5
400096e4: 13 85 5b 00 addi a0, s7, 5
400096e8: 33 05 ac 00 add a0, s8, a0
400096ec: 93 05 09 00 mv a1, s2
400096f0: 13 86 09 00 mv a2, s3
400096f4: 97 10 00 00 auipc ra, 1
400096f8: e7 80 40 c6 jalr -924(ra)
400096fc: 37 05 00 40 lui a0, 262144
40009700: 83 25 05 00 lw a1, 0(a0)
40009704: 13 04 f4 ff addi s0, s0, -1
40009708: 13 05 45 00 addi a0, a0, 4
4000970c: e3 1a 04 fe bnez s0, -12 <_ZN11board_misoc6config3imp9append_at17h077770fffe9a2d09E+0x100>
40009710: 13 06 00 00 mv a2, zero
40009714: b3 85 34 01 add a1, s1, s3
40009718: 23 a2 ba 00 sw a1, 4(s5)
4000971c: 23 a0 ca 00 sw a2, 0(s5)
40009720: 83 2c 41 00 lw s9, 4(sp)
40009724: 03 2c 81 00 lw s8, 8(sp)
40009728: 83 2b c1 00 lw s7, 12(sp)
4000972c: 03 2b 01 01 lw s6, 16(sp)
40009730: 83 2a 41 01 lw s5, 20(sp)
40009734: 03 2a 81 01 lw s4, 24(sp)
40009738: 83 29 c1 01 lw s3, 28(sp)
4000973c: 03 29 01 02 lw s2, 32(sp)
40009740: 83 24 41 02 lw s1, 36(sp)
40009744: 03 24 81 02 lw s0, 40(sp)
40009748: 83 20 c1 02 lw ra, 44(sp)
4000974c: 13 01 01 03 addi sp, sp, 48
40009750: 67 80 00 00 ret
Stack overflow? The new firmware should detect those instead of corrupting memory. This may just be one such detection. @occheung
Stack overflow? The new firmware should detect those instead of corrupting memory. This may just be one such detection. @occheung
It would be useful to know the value of _sstack_guard
in the elf file. That should be the only protected region in satman.
Here's a small incremental input I can give this discussion at the moment:
- @occheung
_sstack_guard
of satman is 0x4002a000. - It's likely the problematic instruction refers to
flush_l2_cache()
, which reads 0x40000000 - 0x40040000 (L2 size being 128*1024 = 0x20000 bytes).- By reading https://github.com/m-labs/artiq/pull/1764, the bootloader is exempt from stack guards. ~~Meanwhile, I can only find calls to
flush_l2_cache()
in the bootloader elsewhere, and it's not called anywhere in the firmware.~~ (Correction: it's also used inanalyzer
andsession
.) - I'm not familiar with the purpose of flushing L2 upon writing/erasing the SPI flash, and I wonder if it is a thing to "uninit" the stack guard in this situation.
- I should spend some time to read about PMP/guard pages.
- By reading https://github.com/m-labs/artiq/pull/1764, the bootloader is exempt from stack guards. ~~Meanwhile, I can only find calls to
@occheung _sstack_guard of satman is 0x4002a000.
Either the code should now panics at 0x4002a000, or the _sstack_guard
was 0x40029000. The PMP region should only be 0x1000 large, otherwise we have an issue with the VexRiscv CPU / PMP config (again).
I'm not familiar with the purpose of flushing L2 upon writing/erasing the SPI flash, and I wonder if it is a thing to "uninit" the stack guard in this situation.
Ah I see the issue. So the firmware is small enough, which will unintentionally triggered the stack guard when flushing the L2 cache.
Currently, the PMP in satman/kernel is enabled by locking the register in machine mode. It is not possible to turn it off. Some of the PMP regions in runtime can be turned on/off because it is enabled by switching to a lower privilege level (e.g. from machine to user). It was implemented for the spawned threads (libfringe).