[On-prem U250] Host reboots after `firesim runworkload`. 1.18.0 regression?
Background Work
- [X] Yes, I searched the mailing list
- [X] Yes, I searched prior issues
- [X] Yes, I searched the documentation
FireSim Version and Hash
70ac61491c4531b935cb1964d09b660798ffb4d5
OS Setup
Linux alveo 5.15.0-92-generic #102~20.04.1-Ubuntu SMP Mon Jan 15 13:09:14 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal
Other Setup
I followed the XDMA-based U250 documentation. https://docs.fires.im/en/1.18.0/Getting-Started-Guides/On-Premises-FPGA-Getting-Started/Xilinx-Alveo-U250-FPGAs.html
Prior to following the steps in the documentation, I reverted the FPGA to golden. https://support.xilinx.com/s/article/71757?language=en_US
Current Behavior
Host reboots after the following output.
$ sudo ./FireSim-xilinx_alveo_u250 +permissive +macaddr0=00:12:6D:00:00:02 +blkdev0=linux-uniform0-br-base.img +niclog0=niclog0 +blkdev-log0=blkdev-log0 +trace-select=1 +trace-start=0 +trace-end=-1 +trace-output-format=0 +dwarf-file-name=linux-uniform0-br-base-bin-dwarf +autocounter-readrate=0 +autocounter-filename-base=AUTOCOUNTERFILE +print-start=0 +print-end=-1 +linklatency0=6405 +netbw0=200 +shmemportname0=default +domain=0x0000 +bus=0x01 +device=0x00 +function=0x0 +bar=0x0 +pci-vendor=0x10ee +pci-device=0x903f +permissive-off +prog0=linux-uniform0-br-base-bin
Using: 0000:01:00.0, BAR ID: 0, PCI Vendor ID: 0x10ee, PCI Device ID: 0x903f
Opening /sys/bus/pci/devices/0000:01:00.0/vendor
Opening /sys/bus/pci/devices/0000:01:00.0/device
examining xdma/.
examining xdma/..
examining xdma/xdma0_h2c_0
Using xdma write queue: /dev/xdma0_h2c_0
Using xdma read queue: /dev/xdma0_c2h_0
UART0 is here (stdin/stdout).
TraceRV 0: Tracing disabled, since +tracefile was not provided.
command line for program 0. argc=26:
+permissive +macaddr0=00:12:6D:00:00:02 +blkdev0=linux-uniform0-br-base.img +niclog0=niclog0 +blkdev-log0=blkdev-log0 +trace-select=1 +trace-start=0 +trace-end=-1 +trace-output-format=0 +dwarf-file-name=linux-uniform0-br-base-bin-dwarf +autocounter-readrate=0 +autocounter-filename-base=AUTOCOUNTERFILE +print-start=0 +print-end=-1 +linklatency0=6405 +netbw0=200 +shmemportname0=default +domain=0x0000 +bus=0x01 +device=0x00 +function=0x0 +bar=0x0 +pci-vendor=0x10ee +pci-device=0x903f +permissive-off linux-uniform0-br-base-bin
FireSim fingerprint: 0x46697265
TracerV: Trigger enabled from 0 to 18446744073709551615 cycles
Commencing simulation.
The reboot seems to be a hard reset, and there's no useful kernel log/syslog.
Expected Behavior
Boots Linux
Other Information
No response
1.17.1 works fine.
Possibly related to #1692
Solved like #1695
@RealJustinNi thanks for the link.
In my case, I'm following the getting started guide, and didn't elaborate any design myself. The bitstream flashed is downloaded by FireSim https://github.com/firesim/firesim/blob/535dcdc29a930525e771f083f2b1c688884c6871/deploy/sample-backup-configs/sample_config_hwdb.yaml#L66 (1.18.0). The memory configuration file is from the same tarball. So I didn't think that reprogramming the memory device is necessary.
Regardless of the above, this still seems to be a regression where the same steps work find on a fresh 1.17.1 checkout.
@caizixian Hello, we have also recently encountered similar issues when running Firesim 1.18.0 and 1.17.1 workloads on the U200 platform about one week ago. Our bitstream is alveo_u200_firesim_rocket_singlecore_no_nic. Interestingly, under version 1.17.1, both firesim infrasetup and firesim runworkload run correctly, and Linux boots up without any problems. However, after successfully executing firesim infrasetup in version 1.18.0, runworkload results in a host freeze issue.
We ultimately resolved the issue by recompiling a buildstream and then proceeding with the FPGA re-programming. After looking at another issue you mentioned, we discovered that it was indeed an issue with the version pointer. Thank you very much!