intel-device-plugins-for-kubernetes icon indicating copy to clipboard operation
intel-device-plugins-for-kubernetes copied to clipboard

doc: [sgx] theoretically, the screencast of the SGX demo is not replicable

Open ttzeng opened this issue 3 years ago • 17 comments

Though I don't have the hardware platform to try out, but it looks like some files are missing which prevents screencast-sgx.sh from running successfully. Specifically:

  • In screen6, I know users can save the docker images in local registry to files, but this must be done in advance, otherwise there will be no sgx-aesmd.tar and sgx-demo.tar to be loaded.

ttzeng avatar Apr 16 '22 21:04 ttzeng

Hello @ttzeng. I think there are much more problems with sync status of the screencast with current status of the project.

  • There is no "master" branch in the repo any more, but screencast refers to it
  • Not sure that node selector is valid in samples, at least I didn't see intel.feature.node.kubernetes.io/sgx on my nodes, but there was feature.node.kubernetes.io/intel.sgx, maybe it was my fault (due to versions mismatch), but the way go to next point
  • All versions in screencast are the latest, for reproducibility over time they should be pinned to specific ones
  • Deployment scenario for NFD was changed
  • I could not figure out how current instructions of NFD deployment describes (involves) overlays/epc-nfd
  • It would be nice to see in the screencast how node description changed after NFD deployed (all better, to show how every step effects node/cluster configuration, not just to show that some pod is running)
  • And as I think firsts steps of local pulling images should be skipped, since all it's possible to use production versions already

dbolshak avatar May 05 '22 08:05 dbolshak

@dbolshak thanks for the detailed feedback. I haven't paid too much attention to the script(s) it but looks like they are useful to many. I'll get at least this issue fixed at some point.

Is there something blocking you to set things up?

mythi avatar May 05 '22 11:05 mythi

@mythi Thanks for feedback and for the screencast. Do not get me wrong, it's very helpful (at least for me). And many thanks for your effort.

So far I don't have any blockers with instructions or screencast, at least looks like I was able to manage all issue I faced. But I think, for people without prior kubernetes and SGX knowledge it's almost impossible.

Unfortunately, I didn't notice at the beginning that it requires CPU with Flexible Launch Control support (or it's not mentioned in docs), so now I am awaiting another server to repeat all steps.

But I have a question which is not directly related to my issues in first message. Could you please point me to documentation which covers remote attestation of an application running inside a pod on SGX aware k8s cluster using sgx-caching-service, or it's not kubernetes specific?

dbolshak avatar May 05 '22 12:05 dbolshak

Unfortunately, I didn't notice at the beginning that it requires CPU with Flexible Launch Control support (or it's not mentioned in docs), so now I am awaiting another server to repeat all steps.

Yeah, this is the requirement in the upstream driver. Our plugin only recognizes the device nodes provided by that driver. FLC is mentioned in the very first sentence of our README.

But I have a question which is not directly related to my issues in first message. Could you please point me to documentation which covers remote attestation of an application running inside a pod on SGX aware k8s cluster using sgx-caching-service, or it's not kubernetes specific?

It's not kubernetes specific as such but we have some helpers that could be used. "PCCS" normally runs somewhere in the datacenter and serves connections from the sgx quote provider lib ("default" is provided by Intel, Azure has "dcap client" for example). This lib needs to be configured to use the network address of that PCCS. The screencast and our sample deployments give examples how to configure the provider lib in case it's used by "aesmd" (for "out-of-proc" quote generation) and the app itself (for "in-proc" quote generation).

In the screencast I have PCCS (sgx-caching-service) in a single-node k8s cluster serving localhost connections so the pods just use HostNetwork: true but normally you'd have PCCS running somewhere else.

DCAP attestation overview doc is here.

mythi avatar May 06 '22 04:05 mythi

@mythi Hello,

I've repeated all my steps on HW accelerated by SGX2. And I have similar issues as before.

I think that the problem is somewhere in NFD. I deployed NFD by the following two commands

kubectl apply -k https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/nfd/overlays/sgx?ref=v0.24.0
kubectl apply -k https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/nfd/overlays/node-feature-rules?ref=v0.24.0

I see there are two expected pods in nfd-worker and nfd-master, also there is NodeFeatureRule (intel-dp-devices), but I can not find expected labels:

kubectl get no -o json | jq .items[].metadata.labels |grep intel.feature.node.kubernetes.io/dlb
  "intel.feature.node.kubernetes.io/dlb": "true",

What I have is

  "beta.kubernetes.io/arch": "amd64",
  "beta.kubernetes.io/instance-type": "k3s",
  "beta.kubernetes.io/os": "linux",
  "feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512BITALG": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512BW": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512CD": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512DQ": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512F": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512IFMA": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI2": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VL": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VPOPCNTDQ": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FMA3": "true",
  "feature.node.kubernetes.io/cpu-cpuid.GFNI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBPB": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SHA": "true",
  "feature.node.kubernetes.io/cpu-cpuid.STIBP": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VAES": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VMX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ": "true",
  "feature.node.kubernetes.io/cpu-cpuid.WBNOINVD": "true",
  "feature.node.kubernetes.io/cpu-hardware_multithreading": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTCMT": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTL3CA": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTMBA": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTMBM": "true",
  "feature.node.kubernetes.io/cpu-rdt.RDTMON": "true",
  "feature.node.kubernetes.io/cpu-sgx.enabled": "true",
  "feature.node.kubernetes.io/kernel-config.NO_HZ": "true",
  "feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE": "true",
  "feature.node.kubernetes.io/kernel-version.full": "5.4.0-109-generic",
  "feature.node.kubernetes.io/kernel-version.major": "5",
  "feature.node.kubernetes.io/kernel-version.minor": "4",
  "feature.node.kubernetes.io/kernel-version.revision": "0",
  "feature.node.kubernetes.io/pci-0300_1a03.present": "true",
  "feature.node.kubernetes.io/storage-nonrotationaldisk": "true",
  "feature.node.kubernetes.io/system-os_release.ID": "ubuntu",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID": "18.04",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "18",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "04",
  "feature.node.kubernetes.io/usb-ef_0b1f_03ee.present": "true",
  "kubernetes.io/arch": "amd64",
  "kubernetes.io/hostname": "euclid-4",
  "kubernetes.io/os": "linux",
  "node-role.kubernetes.io/control-plane": "true",
  "node-role.kubernetes.io/master": "true",
  "node.kubernetes.io/instance-type": "k3s"

NFD worker's log:

I0511 15:15:16.850684       1 nfd-worker.go:155] Node Feature Discovery Worker v0.10.1
2I0511 15:15:16.850826       1 nfd-worker.go:156] NodeName: 'euclid-4'
3I0511 15:15:16.851406       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
4I0511 15:15:16.851541       1 nfd-worker.go:461] worker (re-)configuration successfully completed
5I0511 15:15:16.851599       1 base.go:126] connecting to nfd-master at nfd-master:8080 ...
6I0511 15:15:16.851656       1 component.go:36] [core]parsed scheme: ""
7I0511 15:15:16.851717       1 component.go:36] [core]scheme "" not registered, fallback to default scheme
8I0511 15:15:16.851766       1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{nfd-master:8080  <nil> 0 <nil>}] <nil> <nil>}
9I0511 15:15:16.851793       1 component.go:36] [core]ClientConn switching balancer to "pick_first"
10I0511 15:15:16.851804       1 component.go:36] [core]Channel switches to new LB policy "pick_first"
11I0511 15:15:16.851852       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
12I0511 15:15:16.851914       1 component.go:36] [core]Subchannel picks a new address "nfd-master:8080" to connect
13I0511 15:15:16.852219       1 component.go:36] [core]Channel Connectivity change to CONNECTING
14I0511 15:15:16.854360       1 component.go:36] [core]Subchannel Connectivity change to READY
15I0511 15:15:16.854409       1 component.go:36] [core]Channel Connectivity change to READY
16E0511 15:15:16.887454       1 network.go:145] failed to read net iface attribute speed: read /host-sys/class/net/eth2/speed: invalid argument
17I0511 15:15:16.923914       1 nfd-worker.go:472] starting feature discovery...
18I0511 15:15:16.924606       1 nfd-worker.go:484] feature discovery completed
19I0511 15:15:16.924629       1 nfd-worker.go:565] sending labeling request to nfd-master

I checked the source code of network.go and issue with network interface should not be a real problem.

To check capabilities of my platform I use https://github.com/ayeks/SGX-hardware.git And its output is

eax: 606a6 ebx: 25400800 ecx: 7ffefbff edx: bfebfbff
stepping 6
model 10
family 6
processor type 0
extended model 6
extended family 0
smx: 1

Extended feature bits (EAX=07H, ECX=0H)
eax: 0 ebx: f3bfb7ef ecx: 40417f5e edx: bc040412
sgx available: 1
sgx launch control: 1

CPUID Leaf 12H, Sub-Leaf 0 of Intel SGX Capabilities (EAX=12H,ECX=0)
eax: 403 ebx: 1 ecx: 0 edx: 381f
sgx 1 supported: 1
sgx 2 supported: 1
MaxEnclaveSize_Not64: 1f
MaxEnclaveSize_64: 38

CPUID Leaf 12H, Sub-Leaf 1 of Intel SGX Capabilities (EAX=12H,ECX=1)
eax: b6 ebx: 0 ecx: 2e7 edx: 0

CPUID Leaf 12H, Sub-Leaf 2 of Intel SGX Capabilities (EAX=12H,ECX=2)
eax: c00001 ebx: 20 ecx: 7ec00002 edx: 0
size of EPC section in Processor Reserved Memory, 2028 M

CPUID Leaf 12H, Sub-Leaf 3 of Intel SGX Capabilities (EAX=12H,ECX=3)
eax: 0 ebx: 0 ecx: 0 edx: 0
size of EPC section in Processor Reserved Memory, 0 M

CPUID Leaf 12H, Sub-Leaf 4 of Intel SGX Capabilities (EAX=12H,ECX=4)
eax: 0 ebx: 0 ecx: 0 edx: 0
size of EPC section in Processor Reserved Memory, 0 M

CPUID Leaf 12H, Sub-Leaf 5 of Intel SGX Capabilities (EAX=12H,ECX=5)
eax: 0 ebx: 0 ecx: 0 edx: 0
size of EPC section in Processor Reserved Memory, 0 M

CPUID Leaf 12H, Sub-Leaf 6 of Intel SGX Capabilities (EAX=12H,ECX=6)
eax: 0 ebx: 0 ecx: 0 edx: 0
size of EPC section in Processor Reserved Memory, 0 M

CPUID Leaf 12H, Sub-Leaf 7 of Intel SGX Capabilities (EAX=12H,ECX=7)
eax: 0 ebx: 0 ecx: 0 edx: 0
size of EPC section in Processor Reserved Memory, 0 M

CPUID Leaf 12H, Sub-Leaf 8 of Intel SGX Capabilities (EAX=12H,ECX=8)
eax: 0 ebx: 0 ecx: 0 edx: 0
size of EPC section in Processor Reserved Memory, 0 M

CPUID Leaf 12H, Sub-Leaf 9 of Intel SGX Capabilities (EAX=12H,ECX=9)
eax: 0 ebx: 0 ecx: 0 edx: 0
size of EPC section in Processor Reserved Memory, 0 M

dbolshak avatar May 11 '22 15:05 dbolshak

I think that the problem is somewhere in NFD.

@dbolshak I can see you have:

"feature.node.kubernetes.io/cpu-sgx.enabled": "true", which suggests the cpuid bits are OK and the BIOS has enabled SGX.

The full set of rules is:

 - feature: cpu.cpuid
          matchExpressions:
            SGX: {op: Exists}
            SGXLC: {op: Exists}
        - feature: cpu.sgx
          matchExpressions:
            enabled: {op: IsTrue}
        - feature: kernel.config
          matchExpressions:
            X86_SGX: {op: Exists}

Do you have the in-tree driver (i.e., CONFIG_X86_SGX=y) enabled?

mythi avatar May 11 '22 16:05 mythi

lsmod | grep sgx gives the following

intel_sgx              32768  0

dbolshak avatar May 11 '22 16:05 dbolshak

but empty output in cat /boot/config-5.4.0-109-generic | grep -i sgx

dbolshak avatar May 11 '22 16:05 dbolshak

but empty output in cat /boot/config-5.4.0-109-generic | grep -i sgx

that's the problem. quick fix is to kubectl edit nodefeaturerule intel-dp-devices and drop that kernel.config match.

mythi avatar May 11 '22 16:05 mythi

Thanks for quick response and provided w/around, I will test it.

But I see that I need kernel version 5.11 or above, so after testing w/around I will check full fix

dbolshak avatar May 11 '22 16:05 dbolshak

@mythi

Hello,

I feel myself very terrible for distrusting you again and again. But I am so close to success!

My current problem appears on the following step

kubectl apply -f https://raw.githubusercontent.com/intel/intel-device-plugins-for-kubernetes/main/deployments/operator/samples/deviceplugin_v1_sgxdeviceplugin.yaml -n sgx-ecdsa-quote

So, as the success of it, I expect to see sgxdeviceplugin-sample pod in sgx-ecdsa-quote namespace. But I don't.

Of course I have the desired namespace and all necessary labels on my node

kubectl describe node | grep -i sgx
                    feature.node.kubernetes.io/cpu-sgx.enabled=true
                    intel.feature.node.kubernetes.io/sgx=true
                    nfd.node.kubernetes.io/extended-resources: sgx.intel.com/epc
  sgx.intel.com/enclave:    10
  sgx.intel.com/epc:        2126512128
  sgx.intel.com/provision:  10
  sgx.intel.com/enclave:    10
  sgx.intel.com/epc:        2126512128
  sgx.intel.com/provision:  10
  inteldeviceplugins-system   intel-sgx-plugin-4pdrg                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         58m
  sgx.intel.com/enclave    0           0
  sgx.intel.com/epc        0           0
  sgx.intel.com/provision  0           0

What I've done to debug so far. I checked the deployed sgxdeviceplugin, and it looks by the following way

kubectl describe sgxdeviceplugin
Name:         sgxdeviceplugin-sample
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  deviceplugin.intel.com/v1
Kind:         SgxDevicePlugin
Metadata:
  Creation Timestamp:  2022-05-12T15:07:17Z
  Generation:          1
  Managed Fields:
    API Version:  deviceplugin.intel.com/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:enclaveLimit:
        f:image:
        f:initImage:
        f:logLevel:
        f:nodeSelector:
          .:
          f:intel.feature.node.kubernetes.io/sgx:
        f:provisionLimit:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2022-05-12T15:07:17Z
    API Version:  deviceplugin.intel.com/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:controlledDaemonSet:
          .:
          f:apiVersion:
          f:kind:
          f:name:
          f:namespace:
          f:resourceVersion:
          f:uid:
        f:desiredNumberScheduled:
        f:nodeNames:
        f:numberReady:
    Manager:         intel_deviceplugin_operator
    Operation:       Update
    Subresource:     status
    Time:            2022-05-12T15:07:18Z
  Resource Version:  5710
  UID:               ceda107e-4f6b-44a0-8c38-13bc05e85567
Spec:
  Enclave Limit:  10
  Image:          intel/intel-sgx-plugin:0.24.0
  Init Image:     intel/intel-sgx-initcontainer:0.24.0
  Log Level:      4
  Node Selector:
    intel.feature.node.kubernetes.io/sgx:  true
  Provision Limit:                         10
Status:
  Controlled Daemon Set:
    API Version:             apps/v1
    Kind:                    DaemonSet
    Name:                    intel-sgx-plugin
    Namespace:               inteldeviceplugins-system
    Resource Version:        5688
    UID:                     98078e21-abf7-4059-b761-d91fb51412f6
  Desired Number Scheduled:  1
  Node Names:
    euclid-4
  Number Ready:  1
Events:          <none>

I see no events here, also there are no any issues with controlled daemon set. Pod intel-sgx-plugin-4pdrg in inteldeviceplugins-system namespace works fine.

But logs of an operator pod (pod inteldeviceplugins-controller-manager in inteldeviceplugins-system namespace) have the following lines

I0512 15:01:28.422447       1 reconciler.go:231] "intel-device-plugins-manager/controller/sgxdeviceplugin: " reconciler group="deviceplugin.intel.com" reconciler kind="SgxDevicePlugin" name="sgxdeviceplugin-sample" namespace="" ="(MISSING)"
I0512 15:01:28.503111       1 reconciler.go:231] "intel-device-plugins-manager/controller/sgxdeviceplugin: " reconciler group="deviceplugin.intel.com" reconciler kind="SgxDevicePlugin" name="sgxdeviceplugin-sample" namespace="" ="(MISSING)"

It's not clear why namespace is empty. And I see another issues with namespace (probably they are both related to each other), if I removed device plugin by kubectl delete -f https://raw.githubusercontent.com/intel/intel-device-plugins-for-kubernetes/main/deployments/operator/samples/deviceplugin_v1_sgxdeviceplugin.yaml -n sgx-ecdsa-quote I see the following warning warning: deleting cluster-scoped resources, not scoped to the provided namespace, but it does not effect result of the operation and plugin is going to be deleted without any issues.

Only hope on your help.

I use v0.24.0 and my kuber cluster v1.24.0

dbolshak avatar May 12 '22 16:05 dbolshak

It's not clear why namespace is empty

@dbolshak looks like you have everything correct. SgxDevicePlugin is a cluster scoped resource so just deploy it without the namespace (-n sgx-ecdsa-quote).

mythi avatar May 12 '22 17:05 mythi

Ok, I've deployed without specifying namespace, but it changes nothing, I still don't see sgxdeviceplugin-sample and sections with events of the following command is empty kubectl describe sgxdeviceplugin sgxdeviceplugin-sample

...
....
Spec:
  Enclave Limit:  110
  Image:          intel/intel-sgx-plugin:0.24.0
  Init Image:     intel/intel-sgx-initcontainer:0.24.0
  Log Level:      4
  Node Selector:
    intel.feature.node.kubernetes.io/sgx:  true
  Provision Limit:                         110
Status:
  Controlled Daemon Set:
    API Version:             apps/v1
    Kind:                    DaemonSet
    Name:                    intel-sgx-plugin
    Namespace:               inteldeviceplugins-system
    Resource Version:        80318
    UID:                     e80dc977-8956-424f-9f27-151cd987a144
  Desired Number Scheduled:  1
  Node Names:
    euclid-4
  Number Ready:  1
Events:          <none>

dbolshak avatar May 13 '22 06:05 dbolshak

I still don't see sgxdeviceplugin-sample and sections with events of the following command is empty

I don't think we report events for xDevicePlugins objects today so empty Events is expected AFAICS. Other than that, is there something else missing?

mythi avatar May 13 '22 08:05 mythi

Screencast shows that there is pod with name sgxdeviceplugin-sample should be appeared, I don't have it.

dbolshak avatar May 13 '22 08:05 dbolshak

Screencast shows that there is pod with name sgxdeviceplugin-sample should be appeared, I don't have it.

I see. That posted video seems to be outdated but the latest "script" with up-to-date information is in our demo folder. With ./screencast-play.sh play it runs you trough the steps (beware of some of the problems reported in this issue though).

The video shows the plugin running in that namespace when SgxDevicePlugin was still namespace scoped but we moved it to cluster scoped.

Apologies for some of the outdated information.

mythi avatar May 13 '22 10:05 mythi

Thanks for clarification!

dbolshak avatar May 16 '22 07:05 dbolshak