op-test icon indicating copy to clipboard operation
op-test copied to clipboard

Add test cases for panic/OPAL TI for TOD failover recory failure.

Open maheshsal opened this issue 5 years ago • 2 comments

Two commits

Commit 1: hmi: Add test case to trigger TOD topology switch.

This test triggers the TOD topology failover on all the chips to see OPAL
TI and panic path to make sure OS does not get stuck while going down.

This test needs following skiboot and kernel commit to pass:

skiboot:
  497734984 opal/hmi: set a flag to inform OS that TOD/TB has failed.
  ca349b836 opal/hmi: Don't retry TOD recovery if it is already in failed state.
  017da88b2 opal/hmi: Fix double unlock of hmi lock in failure path.

kernel:
  http://patchwork.ozlabs.org/patch/1051379/

Commit 2: Opal TI: Add test for OPAL TI.

Trigger manual OPAL TI by directly setting scom address provided in
device-tree node ibm,sw-xstop-fir. This is to test basic functionality of
OPAL TI under normal circumstance.

Observations:

  • On Zaius, I see the panic + reboot after HMI failure works fine. But on one of the Witherspoon I have seen hangs in ipmi_msg_sync while dumping dmesg buffer to nvram (pnv_platform_error_reboot->panic_flush_kmsg_end->kmsg_dump->pstore_dump->OPAL..calls..->ipmi_queue_msg_sync). Investigating more to understand why we don't get ipmi timeout which can get systsem out of hang..

  • On Manual OPAL TI, I see following messages: 3.24326|secure|SecureROM valid - enabling functionality 4.57365|IPMI: shutdown requested

    I need to try this on few another system with latest PNOR.

NOTE: The above tests verifies that system reboots successfully after panic or OPAL TI OR else test fails with appropriate error message.

Tests can be run with below option independently: --run testcases.OpTestHMIHandling.OpalTI --run testcases.OpTestHMIHandling.TodTopologyFailoverOpalTI --run testcases.OpTestHMIHandling.TodTopologyFailoverPanic

maheshsal avatar Mar 11 '19 06:03 maheshsal

Observations:

On Zaius, I see the panic + reboot after HMI failure works fine. But on one of the Witherspoon I have seen hangs in ipmi_msg_sync while dumping dmesg buffer to nvram (pnv_platform_error_reboot->panic_flush_kmsg_end->kmsg_dump->pstore_dump ->OPAL..calls..->ipmi_queue_msg_sync). Investigating more to understand why we don't get ipmi timeout which can get systsem out of hang..

The hang mentioned above on witherspoon is now fixed by skiboot patch at http://patchwork.ozlabs.org/patch/1061289/

maheshsal avatar Mar 22 '19 15:03 maheshsal

Can you please rebase this PR?

-Vasant

hegdevasant avatar Feb 20 '20 08:02 hegdevasant