intel-device-plugins-for-kubernetes
intel-device-plugins-for-kubernetes copied to clipboard
doc: [sgx] theoretically, the screencast of the SGX demo is not replicable
Though I don't have the hardware platform to try out, but it looks like some files are missing which prevents screencast-sgx.sh from running successfully. Specifically:
- In screen6, I know users can save the docker images in local registry to files, but this must be done in advance, otherwise there will be no
sgx-aesmd.tarandsgx-demo.tarto be loaded.
Hello @ttzeng. I think there are much more problems with sync status of the screencast with current status of the project.
- There is no "master" branch in the repo any more, but screencast refers to it
- Not sure that node selector is valid in samples, at least I didn't see
intel.feature.node.kubernetes.io/sgxon my nodes, but there wasfeature.node.kubernetes.io/intel.sgx, maybe it was my fault (due to versions mismatch), but the way go to next point - All versions in screencast are the latest, for reproducibility over time they should be pinned to specific ones
- Deployment scenario for NFD was changed
- I could not figure out how current instructions of NFD deployment describes (involves)
overlays/epc-nfd - It would be nice to see in the screencast how node description changed after NFD deployed (all better, to show how every step effects node/cluster configuration, not just to show that some pod is running)
- And as I think firsts steps of local pulling images should be skipped, since all it's possible to use production versions already
@dbolshak thanks for the detailed feedback. I haven't paid too much attention to the script(s) it but looks like they are useful to many. I'll get at least this issue fixed at some point.
Is there something blocking you to set things up?
@mythi Thanks for feedback and for the screencast. Do not get me wrong, it's very helpful (at least for me). And many thanks for your effort.
So far I don't have any blockers with instructions or screencast, at least looks like I was able to manage all issue I faced. But I think, for people without prior kubernetes and SGX knowledge it's almost impossible.
Unfortunately, I didn't notice at the beginning that it requires CPU with Flexible Launch Control support (or it's not mentioned in docs), so now I am awaiting another server to repeat all steps.
But I have a question which is not directly related to my issues in first message. Could you please point me to documentation which covers remote attestation of an application running inside a pod on SGX aware k8s cluster using sgx-caching-service, or it's not kubernetes specific?
Unfortunately, I didn't notice at the beginning that it requires CPU with Flexible Launch Control support (or it's not mentioned in docs), so now I am awaiting another server to repeat all steps.
Yeah, this is the requirement in the upstream driver. Our plugin only recognizes the device nodes provided by that driver. FLC is mentioned in the very first sentence of our README.
But I have a question which is not directly related to my issues in first message. Could you please point me to documentation which covers remote attestation of an application running inside a pod on SGX aware k8s cluster using sgx-caching-service, or it's not kubernetes specific?
It's not kubernetes specific as such but we have some helpers that could be used. "PCCS" normally runs somewhere in the datacenter and serves connections from the sgx quote provider lib ("default" is provided by Intel, Azure has "dcap client" for example). This lib needs to be configured to use the network address of that PCCS. The screencast and our sample deployments give examples how to configure the provider lib in case it's used by "aesmd" (for "out-of-proc" quote generation) and the app itself (for "in-proc" quote generation).
In the screencast I have PCCS (sgx-caching-service) in a single-node k8s cluster serving localhost connections so the pods just use HostNetwork: true but normally you'd have PCCS running somewhere else.
DCAP attestation overview doc is here.
@mythi Hello,
I've repeated all my steps on HW accelerated by SGX2. And I have similar issues as before.
I think that the problem is somewhere in NFD. I deployed NFD by the following two commands
kubectl apply -k https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/nfd/overlays/sgx?ref=v0.24.0
kubectl apply -k https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/nfd/overlays/node-feature-rules?ref=v0.24.0
I see there are two expected pods in nfd-worker and nfd-master, also there is NodeFeatureRule (intel-dp-devices), but I can not find expected labels:
kubectl get no -o json | jq .items[].metadata.labels |grep intel.feature.node.kubernetes.io/dlb
"intel.feature.node.kubernetes.io/dlb": "true",
What I have is
"beta.kubernetes.io/arch": "amd64",
"beta.kubernetes.io/instance-type": "k3s",
"beta.kubernetes.io/os": "linux",
"feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
"feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512BITALG": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512BW": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512CD": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512DQ": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512F": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512IFMA": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI2": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512VL": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512VPOPCNTDQ": "true",
"feature.node.kubernetes.io/cpu-cpuid.FMA3": "true",
"feature.node.kubernetes.io/cpu-cpuid.GFNI": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBPB": "true",
"feature.node.kubernetes.io/cpu-cpuid.SHA": "true",
"feature.node.kubernetes.io/cpu-cpuid.STIBP": "true",
"feature.node.kubernetes.io/cpu-cpuid.VAES": "true",
"feature.node.kubernetes.io/cpu-cpuid.VMX": "true",
"feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ": "true",
"feature.node.kubernetes.io/cpu-cpuid.WBNOINVD": "true",
"feature.node.kubernetes.io/cpu-hardware_multithreading": "true",
"feature.node.kubernetes.io/cpu-rdt.RDTCMT": "true",
"feature.node.kubernetes.io/cpu-rdt.RDTL3CA": "true",
"feature.node.kubernetes.io/cpu-rdt.RDTMBA": "true",
"feature.node.kubernetes.io/cpu-rdt.RDTMBM": "true",
"feature.node.kubernetes.io/cpu-rdt.RDTMON": "true",
"feature.node.kubernetes.io/cpu-sgx.enabled": "true",
"feature.node.kubernetes.io/kernel-config.NO_HZ": "true",
"feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE": "true",
"feature.node.kubernetes.io/kernel-version.full": "5.4.0-109-generic",
"feature.node.kubernetes.io/kernel-version.major": "5",
"feature.node.kubernetes.io/kernel-version.minor": "4",
"feature.node.kubernetes.io/kernel-version.revision": "0",
"feature.node.kubernetes.io/pci-0300_1a03.present": "true",
"feature.node.kubernetes.io/storage-nonrotationaldisk": "true",
"feature.node.kubernetes.io/system-os_release.ID": "ubuntu",
"feature.node.kubernetes.io/system-os_release.VERSION_ID": "18.04",
"feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "18",
"feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "04",
"feature.node.kubernetes.io/usb-ef_0b1f_03ee.present": "true",
"kubernetes.io/arch": "amd64",
"kubernetes.io/hostname": "euclid-4",
"kubernetes.io/os": "linux",
"node-role.kubernetes.io/control-plane": "true",
"node-role.kubernetes.io/master": "true",
"node.kubernetes.io/instance-type": "k3s"
NFD worker's log:
I0511 15:15:16.850684 1 nfd-worker.go:155] Node Feature Discovery Worker v0.10.1
2I0511 15:15:16.850826 1 nfd-worker.go:156] NodeName: 'euclid-4'
3I0511 15:15:16.851406 1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
4I0511 15:15:16.851541 1 nfd-worker.go:461] worker (re-)configuration successfully completed
5I0511 15:15:16.851599 1 base.go:126] connecting to nfd-master at nfd-master:8080 ...
6I0511 15:15:16.851656 1 component.go:36] [core]parsed scheme: ""
7I0511 15:15:16.851717 1 component.go:36] [core]scheme "" not registered, fallback to default scheme
8I0511 15:15:16.851766 1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{nfd-master:8080 <nil> 0 <nil>}] <nil> <nil>}
9I0511 15:15:16.851793 1 component.go:36] [core]ClientConn switching balancer to "pick_first"
10I0511 15:15:16.851804 1 component.go:36] [core]Channel switches to new LB policy "pick_first"
11I0511 15:15:16.851852 1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
12I0511 15:15:16.851914 1 component.go:36] [core]Subchannel picks a new address "nfd-master:8080" to connect
13I0511 15:15:16.852219 1 component.go:36] [core]Channel Connectivity change to CONNECTING
14I0511 15:15:16.854360 1 component.go:36] [core]Subchannel Connectivity change to READY
15I0511 15:15:16.854409 1 component.go:36] [core]Channel Connectivity change to READY
16E0511 15:15:16.887454 1 network.go:145] failed to read net iface attribute speed: read /host-sys/class/net/eth2/speed: invalid argument
17I0511 15:15:16.923914 1 nfd-worker.go:472] starting feature discovery...
18I0511 15:15:16.924606 1 nfd-worker.go:484] feature discovery completed
19I0511 15:15:16.924629 1 nfd-worker.go:565] sending labeling request to nfd-master
I checked the source code of network.go and issue with network interface should not be a real problem.
To check capabilities of my platform I use
https://github.com/ayeks/SGX-hardware.git
And its output is
eax: 606a6 ebx: 25400800 ecx: 7ffefbff edx: bfebfbff
stepping 6
model 10
family 6
processor type 0
extended model 6
extended family 0
smx: 1
Extended feature bits (EAX=07H, ECX=0H)
eax: 0 ebx: f3bfb7ef ecx: 40417f5e edx: bc040412
sgx available: 1
sgx launch control: 1
CPUID Leaf 12H, Sub-Leaf 0 of Intel SGX Capabilities (EAX=12H,ECX=0)
eax: 403 ebx: 1 ecx: 0 edx: 381f
sgx 1 supported: 1
sgx 2 supported: 1
MaxEnclaveSize_Not64: 1f
MaxEnclaveSize_64: 38
CPUID Leaf 12H, Sub-Leaf 1 of Intel SGX Capabilities (EAX=12H,ECX=1)
eax: b6 ebx: 0 ecx: 2e7 edx: 0
CPUID Leaf 12H, Sub-Leaf 2 of Intel SGX Capabilities (EAX=12H,ECX=2)
eax: c00001 ebx: 20 ecx: 7ec00002 edx: 0
size of EPC section in Processor Reserved Memory, 2028 M
CPUID Leaf 12H, Sub-Leaf 3 of Intel SGX Capabilities (EAX=12H,ECX=3)
eax: 0 ebx: 0 ecx: 0 edx: 0
size of EPC section in Processor Reserved Memory, 0 M
CPUID Leaf 12H, Sub-Leaf 4 of Intel SGX Capabilities (EAX=12H,ECX=4)
eax: 0 ebx: 0 ecx: 0 edx: 0
size of EPC section in Processor Reserved Memory, 0 M
CPUID Leaf 12H, Sub-Leaf 5 of Intel SGX Capabilities (EAX=12H,ECX=5)
eax: 0 ebx: 0 ecx: 0 edx: 0
size of EPC section in Processor Reserved Memory, 0 M
CPUID Leaf 12H, Sub-Leaf 6 of Intel SGX Capabilities (EAX=12H,ECX=6)
eax: 0 ebx: 0 ecx: 0 edx: 0
size of EPC section in Processor Reserved Memory, 0 M
CPUID Leaf 12H, Sub-Leaf 7 of Intel SGX Capabilities (EAX=12H,ECX=7)
eax: 0 ebx: 0 ecx: 0 edx: 0
size of EPC section in Processor Reserved Memory, 0 M
CPUID Leaf 12H, Sub-Leaf 8 of Intel SGX Capabilities (EAX=12H,ECX=8)
eax: 0 ebx: 0 ecx: 0 edx: 0
size of EPC section in Processor Reserved Memory, 0 M
CPUID Leaf 12H, Sub-Leaf 9 of Intel SGX Capabilities (EAX=12H,ECX=9)
eax: 0 ebx: 0 ecx: 0 edx: 0
size of EPC section in Processor Reserved Memory, 0 M
I think that the problem is somewhere in NFD.
@dbolshak I can see you have:
"feature.node.kubernetes.io/cpu-sgx.enabled": "true", which suggests the cpuid bits are OK and the BIOS has enabled SGX.
The full set of rules is:
- feature: cpu.cpuid
matchExpressions:
SGX: {op: Exists}
SGXLC: {op: Exists}
- feature: cpu.sgx
matchExpressions:
enabled: {op: IsTrue}
- feature: kernel.config
matchExpressions:
X86_SGX: {op: Exists}
Do you have the in-tree driver (i.e., CONFIG_X86_SGX=y) enabled?
lsmod | grep sgx gives the following
intel_sgx 32768 0
but empty output in
cat /boot/config-5.4.0-109-generic | grep -i sgx
but empty output in
cat /boot/config-5.4.0-109-generic | grep -i sgx
that's the problem. quick fix is to kubectl edit nodefeaturerule intel-dp-devices and drop that kernel.config match.
Thanks for quick response and provided w/around, I will test it.
But I see that I need kernel version 5.11 or above, so after testing w/around I will check full fix
@mythi
Hello,
I feel myself very terrible for distrusting you again and again. But I am so close to success!
My current problem appears on the following step
kubectl apply -f https://raw.githubusercontent.com/intel/intel-device-plugins-for-kubernetes/main/deployments/operator/samples/deviceplugin_v1_sgxdeviceplugin.yaml -n sgx-ecdsa-quote
So, as the success of it, I expect to see sgxdeviceplugin-sample pod in sgx-ecdsa-quote namespace. But I don't.
Of course I have the desired namespace and all necessary labels on my node
kubectl describe node | grep -i sgx
feature.node.kubernetes.io/cpu-sgx.enabled=true
intel.feature.node.kubernetes.io/sgx=true
nfd.node.kubernetes.io/extended-resources: sgx.intel.com/epc
sgx.intel.com/enclave: 10
sgx.intel.com/epc: 2126512128
sgx.intel.com/provision: 10
sgx.intel.com/enclave: 10
sgx.intel.com/epc: 2126512128
sgx.intel.com/provision: 10
inteldeviceplugins-system intel-sgx-plugin-4pdrg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 58m
sgx.intel.com/enclave 0 0
sgx.intel.com/epc 0 0
sgx.intel.com/provision 0 0
What I've done to debug so far. I checked the deployed sgxdeviceplugin, and it looks by the following way
kubectl describe sgxdeviceplugin
Name: sgxdeviceplugin-sample
Namespace:
Labels: <none>
Annotations: <none>
API Version: deviceplugin.intel.com/v1
Kind: SgxDevicePlugin
Metadata:
Creation Timestamp: 2022-05-12T15:07:17Z
Generation: 1
Managed Fields:
API Version: deviceplugin.intel.com/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:spec:
.:
f:enclaveLimit:
f:image:
f:initImage:
f:logLevel:
f:nodeSelector:
.:
f:intel.feature.node.kubernetes.io/sgx:
f:provisionLimit:
Manager: kubectl-client-side-apply
Operation: Update
Time: 2022-05-12T15:07:17Z
API Version: deviceplugin.intel.com/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:controlledDaemonSet:
.:
f:apiVersion:
f:kind:
f:name:
f:namespace:
f:resourceVersion:
f:uid:
f:desiredNumberScheduled:
f:nodeNames:
f:numberReady:
Manager: intel_deviceplugin_operator
Operation: Update
Subresource: status
Time: 2022-05-12T15:07:18Z
Resource Version: 5710
UID: ceda107e-4f6b-44a0-8c38-13bc05e85567
Spec:
Enclave Limit: 10
Image: intel/intel-sgx-plugin:0.24.0
Init Image: intel/intel-sgx-initcontainer:0.24.0
Log Level: 4
Node Selector:
intel.feature.node.kubernetes.io/sgx: true
Provision Limit: 10
Status:
Controlled Daemon Set:
API Version: apps/v1
Kind: DaemonSet
Name: intel-sgx-plugin
Namespace: inteldeviceplugins-system
Resource Version: 5688
UID: 98078e21-abf7-4059-b761-d91fb51412f6
Desired Number Scheduled: 1
Node Names:
euclid-4
Number Ready: 1
Events: <none>
I see no events here, also there are no any issues with controlled daemon set. Pod intel-sgx-plugin-4pdrg in inteldeviceplugins-system namespace works fine.
But logs of an operator pod (pod inteldeviceplugins-controller-manager in inteldeviceplugins-system namespace) have the following lines
I0512 15:01:28.422447 1 reconciler.go:231] "intel-device-plugins-manager/controller/sgxdeviceplugin: " reconciler group="deviceplugin.intel.com" reconciler kind="SgxDevicePlugin" name="sgxdeviceplugin-sample" namespace="" ="(MISSING)"
I0512 15:01:28.503111 1 reconciler.go:231] "intel-device-plugins-manager/controller/sgxdeviceplugin: " reconciler group="deviceplugin.intel.com" reconciler kind="SgxDevicePlugin" name="sgxdeviceplugin-sample" namespace="" ="(MISSING)"
It's not clear why namespace is empty. And I see another issues with namespace (probably they are both related to each other), if I removed device plugin by kubectl delete -f https://raw.githubusercontent.com/intel/intel-device-plugins-for-kubernetes/main/deployments/operator/samples/deviceplugin_v1_sgxdeviceplugin.yaml -n sgx-ecdsa-quote I see the following warning warning: deleting cluster-scoped resources, not scoped to the provided namespace, but it does not effect result of the operation and plugin is going to be deleted without any issues.
Only hope on your help.
I use v0.24.0 and my kuber cluster v1.24.0
It's not clear why namespace is empty
@dbolshak looks like you have everything correct. SgxDevicePlugin is a cluster scoped resource so just deploy it without the namespace (-n sgx-ecdsa-quote).
Ok, I've deployed without specifying namespace, but it changes nothing, I still don't see sgxdeviceplugin-sample and sections with events of the following command is empty
kubectl describe sgxdeviceplugin sgxdeviceplugin-sample
...
....
Spec:
Enclave Limit: 110
Image: intel/intel-sgx-plugin:0.24.0
Init Image: intel/intel-sgx-initcontainer:0.24.0
Log Level: 4
Node Selector:
intel.feature.node.kubernetes.io/sgx: true
Provision Limit: 110
Status:
Controlled Daemon Set:
API Version: apps/v1
Kind: DaemonSet
Name: intel-sgx-plugin
Namespace: inteldeviceplugins-system
Resource Version: 80318
UID: e80dc977-8956-424f-9f27-151cd987a144
Desired Number Scheduled: 1
Node Names:
euclid-4
Number Ready: 1
Events: <none>
I still don't see
sgxdeviceplugin-sampleand sections with events of the following command is empty
I don't think we report events for xDevicePlugins objects today so empty Events is expected AFAICS. Other than that, is there something else missing?
Screencast shows that there is pod with name sgxdeviceplugin-sample should be appeared, I don't have it.
Screencast shows that there is pod with name
sgxdeviceplugin-sampleshould be appeared, I don't have it.
I see. That posted video seems to be outdated but the latest "script" with up-to-date information is in our demo folder. With ./screencast-play.sh play it runs you trough the steps (beware of some of the problems reported in this issue though).
The video shows the plugin running in that namespace when SgxDevicePlugin was still namespace scoped but we moved it to cluster scoped.
Apologies for some of the outdated information.
Thanks for clarification!