openvmm
openvmm copied to clipboard
Adding cancelcontext to closing vmbus channels
shutdown_vf_devices(), the method where nic_shutdown() is called, doesn't have a CancelContext. The wait for vmbus operations during unload_for_servicing() is unbounded and we have seen shutdown_mana() wait indefinitely.
shutdown_pci_devices() returns a ShutdownError. Adding similar logic to shutdown_vf_devices() to return an error so we know where in shutdown_mana() the delay is coming from.
- CancelContext added.
- Timouts return a "deadline exceeded" error that gets traced. unload_for_servicing() proceeds.
- Completing shutdown_vf_devices() without closing the vmbus channels leaves the VM is left in a bad state. There are a lot of vmbus disk errors. The VM is unable to shutdown when 'Stop-VM' is run.
- But when shutdown_mana() would take too long, the operation was getting cancelled due to timeout, which also left the VM in a bad state.
Tested this privately on a lab machine, where I could tweak the Timeout duration to force it.
Testing with a 1 ms duration shows the failure path
The vm sees a lot of disk errors on the vmbus. And Stop-Vm gets stuck.
Output from ohcldiag:
Invoking VTL2 reload.
ERROR: ErrorCode: 32768
ERROR: Message: 'UhVm' failed to reload management VTL at stage: SaveManagementVtlState with error code 0xC0370802 (Management VTL failed to save itself.). (Virtual machine ID 97512202-0EC0-4DD7-8FDF-514461B84936). Management VTL image file: 'custom'. Old management VTL image version: '0.0.0.0'. New management VTL image version: '0.0.0.0'. Servicing type: Explicit. Correlation ID: '3F112AA2-9BAC-421D-BE80-676FC979BEE2'.
ERROR: Failed to reload VTL2. Error = 10
ERROR: Operation failed.
Output from kmsg log:
[170.577773] mana_driver::resources: ERROR leaking resources
[170.583039] underhill_core::worker: ERROR servicing_save_vtl2{ correlation_id=3d597f39-748f-467b-b022-b37062ca2a20}:shutdown_mana: Error closing vmbus channels: cancelled waiting for vmbus channel close<0a><0a>Caused by:<0a> deadline exceeded<0a><0a>Stack backtrace:<0a>
[170.589203] underhill_core::emuplat::netvsp: ERROR Failed to disconnect endpoint err=channel closed
[171.369326] underhill_core::dispatch: ERROR error while handling servicing error=cancelled waiting for vmbus channel close: deadline exceeded
Output on the VM:
ubuntu2004-server login: [ 65.607844] blk_update_request: I/O error, dev sda, sector 36771256 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
[ 69.667717] blk_update_request: I/O error, dev sda, sector 3338456 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[ 69.768075] Buffer I/O error on device dm-0, logical block 23579
Testing with a 1 second duration demonstrates the normal success
Output from ohcldiag:
Invoking VTL2 reload.
Reloading VTL2 completed without any errors.
Output from kmsg log:
[103.120834] state_unit: INFO servicing_save_vtl2{ correlation_id=fa2b372b-d10c-41fa-bdc2-ec74692653ac}:save_units:state_change{ operation="save"}: state change complete duration=25.74455ms
[103.125589] vmbus_server: INFO revoking channel offer_id=OfferId(0)
[103.125777] vmbus_server::channels: INFO rescinding channel from guest channel_id=0xa
[103.126055] vmbus_server: INFO revoking channel offer_id=OfferId(e)
[103.126211] vmbus_server::channels: INFO rescinding channel from guest channel_id=0xf
[103.126445] vmbus_server: INFO revoking channel offer_id=OfferId(f)
[103.126584] vmbus_server::channels: INFO rescinding channel from guest channel_id=0x10
[103.126831] vmbus_server: INFO revoking channel offer_id=OfferId(10)
[103.126980] vmbus_server::channels: INFO rescinding channel from guest channel_id=0x11
[103.139154] underhill_core::emuplat::netvsp: INFO Network endpoint disconnected