AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[BUG] AKS Confidential Computing bugs when creating pod using my images

Open blossomin opened this issue 1 year ago • 1 comments

Describe the bug I want to use AKS confidential computing for my tasks, and I found that when I created pods using my images, the pod failed to create, and if I replace the image in the k8s yaml file, this can be launched. I collected the kata and containerd debug information here, you can use this to debug: https://github.com/blossomin/akslog

inside this logs:

log file name suffix: myworker means using my own worker image, and mcr-pytorch means using("mcr.microsoft.com/azurelinux/base/pytorch:2.2.2-1-azl3.0.20240824-amd64")

by simply comparing the kata logs: I found one statement only exists in the worker_error_kata_myworker.log: cloud-hypervisor: 11.942990s: ERROR:virtio-devices/src/block.rs:814 -- failed to create new AsyncIo: Failed creating a new AsyncIo: Resource temporarily unavailable (os error 11)" It seems this kata-agent dies after this, since this statement followed by many KILL_EVENT received, stopping epoll loop not sure about this, bacause in the kata log from mcr-pytorch also has this KILL_EVENT.

To Reproduce Steps to reproduce the behavior: Currently, my image is internal, so hard to reproduce,

Expected behavior this pod can be launched without any problems using any images

blossomin avatar Sep 23 '24 08:09 blossomin

@agowdamsft would you be able to assist?

One follow up: my worker image has 40 layers, while the "mcr.microsoft.com/azurelinux/base/pytorch:2.2.2-1-azl3.0.20240824-amd64" has about 13 layers.

I guess this issue is:

  1. one layer is mapped to one virtio-pci device,
  2. only 31 PCI slots per (confdiential)-VM

this causes the resource contention/shortage.

blossomin avatar Sep 26 '24 14:09 blossomin

This issue has been automatically marked as stale because it has not had any activity for 30 days. It will be closed if no further activity occurs within 7 days of this comment. @angarg05

This issue will now be closed because it hasn't had any activity for 7 days after stale. blossomin feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.