firesim icon indicating copy to clipboard operation
firesim copied to clipboard

[On-prem U250] Host reboots after `firesim runworkload`. 1.18.0 regression?

Open caizixian opened this issue 1 year ago • 5 comments

Background Work

FireSim Version and Hash

70ac61491c4531b935cb1964d09b660798ffb4d5

OS Setup

Linux alveo 5.15.0-92-generic #102~20.04.1-Ubuntu SMP Mon Jan 15 13:09:14 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.6 LTS
Release:	20.04
Codename:	focal

Other Setup

I followed the XDMA-based U250 documentation. https://docs.fires.im/en/1.18.0/Getting-Started-Guides/On-Premises-FPGA-Getting-Started/Xilinx-Alveo-U250-FPGAs.html

Prior to following the steps in the documentation, I reverted the FPGA to golden. https://support.xilinx.com/s/article/71757?language=en_US

Current Behavior

Host reboots after the following output.

$ sudo ./FireSim-xilinx_alveo_u250 +permissive   +macaddr0=00:12:6D:00:00:02 +blkdev0=linux-uniform0-br-base.img +niclog0=niclog0 +blkdev-log0=blkdev-log0  +trace-select=1 +trace-start=0 +trace-end=-1 +trace-output-format=0 +dwarf-file-name=linux-uniform0-br-base-bin-dwarf +autocounter-readrate=0 +autocounter-filename-base=AUTOCOUNTERFILE  +print-start=0 +print-end=-1 +linklatency0=6405 +netbw0=200 +shmemportname0=default  +domain=0x0000 +bus=0x01 +device=0x00 +function=0x0 +bar=0x0 +pci-vendor=0x10ee +pci-device=0x903f +permissive-off +prog0=linux-uniform0-br-base-bin
Using: 0000:01:00.0, BAR ID: 0, PCI Vendor ID: 0x10ee, PCI Device ID: 0x903f
Opening /sys/bus/pci/devices/0000:01:00.0/vendor
Opening /sys/bus/pci/devices/0000:01:00.0/device
examining xdma/.
examining xdma/..
examining xdma/xdma0_h2c_0
Using xdma write queue: /dev/xdma0_h2c_0
Using xdma read queue: /dev/xdma0_c2h_0
UART0 is here (stdin/stdout).
TraceRV 0: Tracing disabled, since +tracefile was not provided.
command line for program 0. argc=26:
+permissive +macaddr0=00:12:6D:00:00:02 +blkdev0=linux-uniform0-br-base.img +niclog0=niclog0 +blkdev-log0=blkdev-log0 +trace-select=1 +trace-start=0 +trace-end=-1 +trace-output-format=0 +dwarf-file-name=linux-uniform0-br-base-bin-dwarf +autocounter-readrate=0 +autocounter-filename-base=AUTOCOUNTERFILE +print-start=0 +print-end=-1 +linklatency0=6405 +netbw0=200 +shmemportname0=default +domain=0x0000 +bus=0x01 +device=0x00 +function=0x0 +bar=0x0 +pci-vendor=0x10ee +pci-device=0x903f +permissive-off linux-uniform0-br-base-bin
FireSim fingerprint: 0x46697265
TracerV: Trigger enabled from 0 to 18446744073709551615 cycles
Commencing simulation.

The reboot seems to be a hard reset, and there's no useful kernel log/syslog.

Expected Behavior

Boots Linux

Other Information

No response

caizixian avatar Mar 07 '24 03:03 caizixian

1.17.1 works fine.

caizixian avatar Mar 07 '24 06:03 caizixian

Possibly related to #1692

caizixian avatar Mar 07 '24 06:03 caizixian

Solved like #1695

RealJustinNi avatar Mar 07 '24 10:03 RealJustinNi

@RealJustinNi thanks for the link.

In my case, I'm following the getting started guide, and didn't elaborate any design myself. The bitstream flashed is downloaded by FireSim https://github.com/firesim/firesim/blob/535dcdc29a930525e771f083f2b1c688884c6871/deploy/sample-backup-configs/sample_config_hwdb.yaml#L66 (1.18.0). The memory configuration file is from the same tarball. So I didn't think that reprogramming the memory device is necessary.

Regardless of the above, this still seems to be a regression where the same steps work find on a fresh 1.17.1 checkout.

caizixian avatar Mar 08 '24 00:03 caizixian

@caizixian Hello, we have also recently encountered similar issues when running Firesim 1.18.0 and 1.17.1 workloads on the U200 platform about one week ago. Our bitstream is alveo_u200_firesim_rocket_singlecore_no_nic. Interestingly, under version 1.17.1, both firesim infrasetup and firesim runworkload run correctly, and Linux boots up without any problems. However, after successfully executing firesim infrasetup in version 1.18.0, runworkload results in a host freeze issue.

We ultimately resolved the issue by recompiling a buildstream and then proceeding with the FPGA re-programming. After looking at another issue you mentioned, we discovered that it was indeed an issue with the version pointer. Thank you very much!

RealJustinNi avatar Mar 10 '24 12:03 RealJustinNi