openvmm icon indicating copy to clipboard operation
openvmm copied to clipboard

Adding cancelcontext to closing vmbus channels

Open erfrimod opened this issue 11 months ago • 0 comments

shutdown_vf_devices(), the method where nic_shutdown() is called, doesn't have a CancelContext. The wait for vmbus operations during unload_for_servicing() is unbounded and we have seen shutdown_mana() wait indefinitely.

shutdown_pci_devices() returns a ShutdownError. Adding similar logic to shutdown_vf_devices() to return an error so we know where in shutdown_mana() the delay is coming from.

  • CancelContext added.
  • Timouts return a "deadline exceeded" error that gets traced. unload_for_servicing() proceeds.
  • Completing shutdown_vf_devices() without closing the vmbus channels leaves the VM is left in a bad state. There are a lot of vmbus disk errors. The VM is unable to shutdown when 'Stop-VM' is run.
    • But when shutdown_mana() would take too long, the operation was getting cancelled due to timeout, which also left the VM in a bad state.

Tested this privately on a lab machine, where I could tweak the Timeout duration to force it.

Testing with a 1 ms duration shows the failure path

The vm sees a lot of disk errors on the vmbus. And Stop-Vm gets stuck.

Output from ohcldiag:

Invoking VTL2 reload.
ERROR: ErrorCode: 32768
ERROR: Message: 'UhVm' failed to reload management VTL at stage: SaveManagementVtlState with error code 0xC0370802 (Management VTL failed to save itself.). (Virtual machine ID 97512202-0EC0-4DD7-8FDF-514461B84936). Management VTL image file: 'custom'. Old management VTL image version: '0.0.0.0'. New management VTL image version: '0.0.0.0'. Servicing type: Explicit. Correlation ID: '3F112AA2-9BAC-421D-BE80-676FC979BEE2'.
ERROR: Failed to reload VTL2. Error = 10
ERROR: Operation failed.

Output from kmsg log:

[170.577773] mana_driver::resources: ERROR  leaking resources
[170.583039] underhill_core::worker: ERROR servicing_save_vtl2{ correlation_id=3d597f39-748f-467b-b022-b37062ca2a20}:shutdown_mana:  Error closing vmbus channels: cancelled waiting for vmbus channel close<0a><0a>Caused by:<0a>    deadline exceeded<0a><0a>Stack backtrace:<0a>
[170.589203] underhill_core::emuplat::netvsp: ERROR  Failed to disconnect endpoint err=channel closed
[171.369326] underhill_core::dispatch: ERROR  error while handling servicing error=cancelled waiting for vmbus channel close: deadline exceeded

Output on the VM:

ubuntu2004-server login: [   65.607844] blk_update_request: I/O error, dev sda, sector 36771256 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
[   69.667717] blk_update_request: I/O error, dev sda, sector 3338456 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[   69.768075] Buffer I/O error on device dm-0, logical block 23579

Testing with a 1 second duration demonstrates the normal success

Output from ohcldiag:

Invoking VTL2 reload.
Reloading VTL2 completed without any errors.

Output from kmsg log:

[103.120834] state_unit: INFO servicing_save_vtl2{ correlation_id=fa2b372b-d10c-41fa-bdc2-ec74692653ac}:save_units:state_change{ operation="save"}:  state change complete duration=25.74455ms
[103.125589] vmbus_server: INFO  revoking channel offer_id=OfferId(0)
[103.125777] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0xa
[103.126055] vmbus_server: INFO  revoking channel offer_id=OfferId(e)
[103.126211] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0xf
[103.126445] vmbus_server: INFO  revoking channel offer_id=OfferId(f)
[103.126584] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0x10
[103.126831] vmbus_server: INFO  revoking channel offer_id=OfferId(10)
[103.126980] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0x11
[103.139154] underhill_core::emuplat::netvsp: INFO  Network endpoint disconnected

erfrimod avatar Jan 28 '25 19:01 erfrimod