coreos-assembler icon indicating copy to clipboard operation
coreos-assembler copied to clipboard

Kola ext.config.shared.kdump.crash and coreos.boot-mirror.luks tests are failing for ppc64le

Open ravanelli opened this issue 3 years ago • 3 comments

Jenkins log: https://jenkins-rhcos.cloud.p8.psi.redhat.com/job/rhcos/job/rhcos-rhcos-4.11/129/consoleFull

  • ext.config.shared.kdump.crash:

There is a Kernel panic going on in this test:

[    0.293216] Unable to handle kernel paging request for data at address 0xc00000002ffb0000
[    0.293294] Faulting instruction address: 0xc0000000080a94e0
[    0.293358] Oops: Kernel access of bad area, sig: 11 [#1]
[    0.293407] LE SMP NR_CPUS=2048 NUMA pSeries
[    0.293460] Modules linked in:
[    0.293502] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.18.0-348.el8.ppc64le #1
[    0.293577] NIP:  c0000000080a94e0 LR: c000000008444330 CTR: 0000000000000800
[    0.293652] REGS: c0000000104931d0 TRAP: 0300   Not tainted  (4.18.0-348.el8.ppc64le)
[    0.293727] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 24000220  XER: 00000000
[    0.293805] CFAR: 00003fff916c6de4 DAR: c00000002ffb0000 DSISR: 40000000 IRQMASK: 0 
[    0.293805] GPR00: 000000000e000000 c000000010493460 c000000009cd9100 c00000000e480000 
[    0.293805] GPR04: c00000002ffb0000 0000000000040000 0000000000000800 03ffffffff1b8000 
[    0.293805] GPR08: 0000000080000000 0000000000000010 0000000000000020 0000000000000030 
[    0.293805] GPR12: 0000000000000040 c00000000a950400 0000000000000050 0000000000000060 
[    0.293805] GPR16: 0000000000000070 0000000000000000 0000000000000000 0000000000000000 
[    0.293805] GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000010eb9fe0 
[    0.293805] GPR24: c00000000a9176d0 c000000008fb1dd8 c000000008fb1c50 c00000000914c030 
[    0.293805] GPR28: c000000017ff470c c00000002ffb0000 c00000000e480000 0000000000040000 
[    0.294458] NIP [c0000000080a94e0] memcpy_power7+0x400/0x7e0
[    0.294523] LR [c000000008444330] kmemdup+0x50/0x80
[    0.294573] Call Trace:
[    0.294600] [c000000010493460] [c0000000080a9354] memcpy_power7+0x274/0x7e0 (unreliable)
[    0.294676] [c000000010493560] [c000000008444330] kmemdup+0x50/0x80
[    0.294741] [c0000000104935a0] [c00000000893f4f0] tpm_read_log_of+0x110/0x1f0
[    0.294816] [c000000010493630] [c00000000893e154] tpm_bios_log_setup+0x74/0x270
[    0.294892] [c0000000104936c0] [c0000000089360b8] tpm_chip_register+0xb8/0x3b0
[    0.294968] [c000000010493750] [c000000008946de0] tpm_ibmvtpm_probe+0x460/0x790
[    0.295045] [c000000010493830] [c000000008120394] vio_bus_probe+0xa4/0x544
[    0.295108] [c000000010493900] [c00000000896554c] driver_probe_device+0x18c/0x8d0
[    0.295184] [c0000000104939b0] [c000000008966168] __driver_attach+0x1a8/0x290
[    0.295260] [c000000010493a30] [c000000008961148] bus_for_each_dev+0xa8/0x130
[    0.295343] [c000000010493a90] [c000000008964384] driver_attach+0x34/0x50
[    0.295407] [c000000010493ab0] [c0000000089638e8] bus_add_driver+0x228/0x300
[    0.295482] [c000000010493b40] [c000000008967834] driver_register+0xb4/0x1c0
[    0.295559] [c000000010493bb0] [c00000000811c830] __vio_register_driver+0x80/0xe0
[    0.295636] [c000000010493c30] [c0000000096475fc] ibmvtpm_module_init+0x34/0x48
[    0.295712] [c000000010493c50] [c0000000080102d4] do_one_initcall+0x64/0x280
[    0.295789] [c000000010493d20] [c0000000095e486c] kernel_init_freeable+0x
[console.txt](https://github.com/coreos/coreos-assembler/files/8144854/console.txt)
388/0x444
[    0.295864] [c000000010493db0] [c000000008010688] kernel_init+0x24/0x148
[    0.295929] [c000000010493e20] [c00000000800b7d8] ret_from_kernel_thread+0x5c/0x64
[    0.296003] Instruction dump:
[    0.296043] fa010080 39800040 39c00050 39e00060 3a000070 7cc903a6 48000018 60000000 
[    0.296120] 60000000 60000000 60000000 60000000 <7ce020ce> 7cc448ce 7ca450ce 7c8458ce 
[    0.296199] ---[ end trace d8dc57b279544e4f ]---
[    0.297963] 

Looks it could be related to:

virtioblk_transfer failed! type=0, status = 1
virtioblk_transfer failed! type=0, status = 1
virtioblk_transfer: Access beyond end of device!
virtioblk_transfer: Access beyond end of device!
virtioblk_transfer: Access beyond end of device!
error: ../../grub-core/term/serial.c:217:serial port `com0' isn't found.
error: ../../grub-core/commands/terminal.c:138:terminal `serial' isn't found.
error: ../../grub-core/commands/terminal.c:138:terminal `serial' isn't found.

Kola logs: journal.txt console.txt journal.txt

  • coreos.boot-mirror.luks:

For this test the server never booted, looks both issues are related to virtioblk_transfer: Access beyond end of device!

Logs for coreos.boot-mirror.luks console.txt journal.txt

ravanelli avatar Feb 25 '22 21:02 ravanelli

@bgilbert Do you know if it could something disk related?

Kernel command line: BOOT_IMAGE=(ieee1275/disk1,gpt3)/ostree/rhcos-7db1cacf17f2804e991baf5b3f033706ac3c1ce1f0de768270902db7be6076f9/vmlinuz-4.18.0-348.el8.ppc64le random.trust_cpu=on console=tty0 console=hvc0,115200n8 ignition.platform.id=qemu ostree=/ostree/boot.1/rhcos/7db1cacf17f2804e991baf5b3f033706ac3c1ce1f0de768270902db7be6076f9/0 rd.md.uuid=11dc3d66:d9c7ed2d:9e3b668e:7d0e9960 rd.luks.name=b76ad3be-62bf-444c-8c5f-8ad28c328a4a=root root=UUID=44b7cf85-25f3-44fb-a18c-5b788584bad8 rw rootflags=prjquota boot=UUID=5e8ce883-d900-4e63-b00b-4dd7d91336ee

ravanelli avatar Feb 25 '22 21:02 ravanelli

The ext.config.shared.kdump.crash one is odd. The regular kernel boots fine, and then the crashdump kernel crashes in a TPM driver before we ever reach userspace.

coreos.boot-mirror.luks removes a disk and then verifies that we can boot from the other one, but we apparently can't. I assume this test has previously worked on ppc64le? It could be a firmware issue, or it could be that we're no longer setting up the bootloader or mirroring correctly.

bgilbert avatar Feb 26 '22 01:02 bgilbert

@bgilbert Thanks for the comments. I tried some downgrades, but they don't seem to work. We will need to debug it more to see if it should or shouldn't run on ppc64le. So far, these tests seem they never really worked.

ravanelli avatar Mar 03 '22 14:03 ravanelli

With upstream commit bd7dc90 the ext.config.kdump.crash test passes again.

This is tagged in v6.1-rc1+. I tested with latest rawhide with kernel-6.1.0-0.rc2.21.fc38.ppc64le.

dustymabe avatar Oct 26 '22 20:10 dustymabe

The test passed with kernel: 6.1.7-200.fc37.ppc64le

[coreos-assembler]$ cosa kola run  --build 37.20230122.20.0 ext.config.kdump.crash
kola -p qemu-unpriv --build 37.20230122.20.0 --output-dir tmp/kola run ext.config.kdump.crash
⚠️  Skipping kola test pattern "fcos.internet":
  👉 https://github.com/coreos/coreos-assembler/pull/1478
⚠️  Skipping kola test pattern "podman.workflow":
  👉 https://github.com/coreos/coreos-assembler/pull/1478
⚠️  Skipping kola test pattern "coreos.boot-mirror.luks":
  👉 https://github.com/coreos/coreos-assembler/issues/2725
⚠️  Skipping kola test pattern "coreos.boot-mirror":
  👉 https://github.com/coreos/coreos-assembler/issues/2725
🕒 Snoozing kola test pattern "ext.config.platforms.aws.nvme" until Feb 10 2023:
  👉 https://github.com/coreos/fedora-coreos-tracker/issues/1306#issuecomment-1378864963
=== RUN   ext.config.kdump.crash
--- PASS: ext.config.kdump.crash (128.58s)
PASS, output in tmp/kola

Shilpi-Das1 avatar Jan 30 '23 11:01 Shilpi-Das1

@Shilpi-Das1 Did you get to retest the failing tests due to this issue for RHCOS? We have multiple tests in our rhcos-denylist due to this issue. Can you run those tests and update the list accordingly?

gursewak1997 avatar Mar 20 '23 16:03 gursewak1997