Add privileged mode for SST
The NRI pod needs to access the /dev/isst_interface device in the privileged mode.
This pr addresses https://github.com/containers/nri-plugins/issues/101
@marquiz Do you have any comments for this pr?
Hi, @marquiz . I tried to use capabilities to grant privileges to access the /dev/isst_interface device. But I met a problem when using capabilities like:
securityContext:
allowPrivilegeEscalation: true
capabilities:
add:
- AUDIT_CONTROL
- AUDIT_READ
- AUDIT_WRITE
- BLOCK_SUSPEND
- CHOWN
- DAC_OVERRIDE
- DAC_READ_SEARCH
- FOWNER
- FSETID
- IPC_LOCK
- IPC_OWNER
- KILL
- LEASE
- LINUX_IMMUTABLE
- MAC_ADMIN
- MAC_OVERRIDE
- MKNOD
- NET_ADMIN
- NET_BIND_SERVICE
- NET_BROADCAST
- NET_RAW
- SETGID
- SETFCAP
- SETPCAP
- SETUID
- SYS_ADMIN
- SYS_BOOT
- SYS_CHROOT
- SYS_MODULE
- SYS_NICE
- SYS_PACCT
- SYS_PTRACE
- SYS_RAWIO
- SYS_RESOURCE
- SYS_TIME
- SYS_TTY_CONFIG
- SYSLOG
In this securitycontext, I didn't know which capabilities I should add so I added all the capabilities to the nri pod. But the nri pod still reported an error:
W0912 08:17:25.346297 1 system.go:297] failed to get SST info for package 1: failed to read SST PP info: Mbox command failed with failed to open is
st device "/host/dev/isst_interface": open /host/dev/isst_interface: operation not permitted
Do you have any ideas on that?
If I don't add extra capabilities to the nri pod. The grep command output is:
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/1340183/status
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=00000000a80425fb
WARNING: libcap needs an update (cap=40 should have a name).
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
And, if I add the extra capabilities to the nri pod. The grep command output is:
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/1334365/status
CapInh: 0000000000000000 CapPrm: 00000037ffffffff
CapEff: 00000037ffffffff
CapBnd: 00000037ffffffff
CapAmb: 0000000000000000
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=00000037ffffffff
WARNING: libcap needs an update (cap=40 should have a name).
0x00000037ffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_
net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_block_suspe
nd,cap_audit_read
So, it seems that all the capabilities have already added to the nri pod. But the nri pod still can't access the /dev/isst_interface device.
So, it seems that all the capabilities have already added to the nri pod. But the nri pod still can't access the
/dev/isst_interfacedevice.
Thanks @changzhi1990 for debugging this. Fair enough, seems like we need the privileged mode, then.
I'd still like to understand why that fails. Maybe its the ambient capabilities somehow being required (although I don't quite understand why)...
So, it seems that all the capabilities have already added to the nri pod. But the nri pod still can't access the
/dev/isst_interfacedevice.Thanks @changzhi1990 for debugging this. Fair enough, seems like we need the privileged mode, then.
I'd still like to understand why that fails. Maybe its the ambient capabilities somehow being required (although I don't quite understand why)...
Responding to myself: I think it's because runc uses cgroups to control access to device nodes. To make this work without privileged /dev/isst_interface would need to be added as a device to the container, not just mount :/
So, it seems that all the capabilities have already added to the nri pod. But the nri pod still can't access the
/dev/isst_interfacedevice.Thanks @changzhi1990 for debugging this. Fair enough, seems like we need the privileged mode, then. I'd still like to understand why that fails. Maybe its the ambient capabilities somehow being required (although I don't quite understand why)...
Responding to myself: I think it's because runc uses cgroups to control access to device nodes. To make this work without privileged
/dev/isst_interfacewould need to be added as a device to the container, not just mount :/
Thanks for your response. I will try this.
I have compared the capabilities of two scenarios.
- Use "privileged: true"
- Add all capabilities
In the scenario1,
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ ps -ef|grep nri
root 29920 29869 15 18:24 ? 00:00:02 /bin/nri-resource-policy-topology-aware --host-root /host --fallback-config /etc/nri-resource-policy/nri-resource-policy.cfg --pid-file /tmp/nri-resource-policy.pid -metrics-interval 5s
sdp 30432 3890 0 18:24 pts/0 00:00:00 grep --color=auto nri
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/29920/status
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/29869/status
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=000001ffffffffff
WARNING: libcap needs an update (cap=40 should have a name).
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,38,39,40
In the scenario2,
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ ps -ef|grep nri
root 25523 25395 11 18:22 ? 00:00:03 /bin/nri-resource-policy-topology-aware --host-root /host --fallback-config /etc/nri-resource-policy/nri-resource-policy.cfg --pid-file /tmp/nri-resource-policy.pid -metrics-interval 5s
sdp 27017 3890 0 18:22 pts/0 00:00:00 grep --color=auto nri
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/25523/status
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/25395/status
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=000003ffffffffff
WARNING: libcap needs an update (cap=40 should have a name).
0x000003ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,38,39,40,41
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=000001ffffffffff
WARNING: libcap needs an update (cap=40 should have a name).
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,38,39,40
The only difference between these two scenarios is that there is an extra "41" capabilities in scenario2.
Maybe.... we need an sst device plugin, like dsb device plugin, gpu device plugin, etc?
Thanks for your response. I will try this.
@changzhi1990, ah sorry, I was a bit "equivocal" on this. I was merely speaking to myself because there is nothing else than privileged mode in the current setup that can make it work.
Maybe.... we need an sst device plugin, like dsb device plugin, gpu device plugin, etc?
Yes, this would be a way to do it. Another possible solution would be running another NRI plugin before this one that would insert the isst_interface device in the resource-policy container but let's not go there yet...
Thanks for your response. I will try this.
@changzhi1990, ah sorry, I was a bit "equivocal" on this. I was merely speaking to myself because there is nothing else than privileged mode in the current setup that can make it work.
Maybe.... we need an sst device plugin, like dsb device plugin, gpu device plugin, etc?
Yes, this would be a way to do it. Another possible solution would be running another NRI plugin before this one that would insert the isst_interface device in the resource-policy container but let's not go there yet...
Alright, Is worth to create an sst plugin? I mean maybe there are a lot of works to do
Alright, Is worth to create an sst plugin? I mean maybe there are a lot of works to do
Talking about another NRI plugin @klihub had an idea about this (it seems that there already exists a sample plugin for this purpose, i.e. injecting a device) so that could be the solution, indeed. Let @klihub provide the details here
@changzhi1990 You should be able to get read-only access to /dev/sst_interface using NRI itself and an(other) NRI plugin and without requiring any cpabilities or using a privileged container. We have a sample NRI plugin which can inject devices and the necessary cgroup access rules for them based on annotations.
For instance, I used that plugin and this following pod spec to test if I can do 'SST discovery':
apiVersion: v1
kind: Pod
metadata:
name: sst-test
annotations:
devices.nri.io/container.c0: |+
- path: /dev/isst_interface
type: c
major: 10
minor: 123
file_mode: 0o600
spec:
containers:
- name: c0
image: quay.io/marquiz/goresctrl:test
command:
- sh
- -c
- sleep 3600
resources:
requests:
cpu: 250m
memory: 200M
limits:
cpu: 500m
memory: 200M
securityContext:
privileged: false
capabilities:
drop:
- all
imagePullPolicy: IfNotPresent
Just run the device-injector plugin (for instance ./device-injector --idx 10 -name device-injector manually for testing) before starting the pod. Then you should be able to do this:
klitkey1@emr-1:~/xfer$ kubectl apply -f sst-test.yaml
klitkey1@emr-1:~/xfer$ kubectl exec -ti sst-test -c c0 -- /bin/bash --login
root@sst-test:/go/builder# /go/bin/sst-ctl info
...
PPCurrentLevel: 0
PPLocked: true
PPMaxLevel: 4
PPSupported: true
PPVersion: 3
TFEnabled: false
TFSupported: true
...
This sample plugin was not really meant for production. It's merely a sample plugin which demonstrate some of NRI's capabilities. Anyway, if you'd like to use this plugin or you roll your own, you can enable it permanently on your cluster nodes by symlinking it into /opt/nri/plugins:
klitkey1@emr-1:~/xfer$ sudo mkdir -p /opt/nri/plugins
sudo ln -s $(pwd)/device-injector /opt/nri/plugins/10-device-injector
klitkey1@emr-1:~/xfer$ sudo systemctl restart containerd # or crio
After this you should be able to annotate your pods for device injection for testing...
In a production environment you might want to restrict somehow (for instance by namespaces) which pods can be annotated with injected devices. Also, you might want to deploy that plugin itself as a DaemonSet instead of having to install it separately on each of your worker nodes. If there is enough interest, we can consider polishing that plugin, adding any necessary mechanisms for restricting access to annotation-based device injection, etc. and creating images and other deployment artifacts for it within the nri-plugins repo.
@changzhi1990 You should be able to get read-only access to
/dev/sst_interfaceusing NRI itself and an(other) NRI plugin and without requiring any cpabilities or using a privileged container. We have a sample NRI plugin which can inject devices and the necessary cgroup access rules for them based on annotations.For instance, I used that plugin and this following pod spec to test if I can do 'SST discovery':
apiVersion: v1 kind: Pod metadata: name: sst-test annotations: devices.nri.io/container.c0: |+ - path: /dev/isst_interface type: c major: 10 minor: 123 file_mode: 0o600 spec: containers: - name: c0 image: quay.io/marquiz/goresctrl:test command: - sh - -c - sleep 3600 resources: requests: cpu: 250m memory: 200M limits: cpu: 500m memory: 200M securityContext: privileged: false capabilities: drop: - all imagePullPolicy: IfNotPresentJust run the device-injector plugin (for instance
./device-injector --idx 10 -name device-injectormanually for testing) before starting the pod. Then you should be able to do this:klitkey1@emr-1:~/xfer$ kubectl apply -f sst-test.yaml klitkey1@emr-1:~/xfer$ kubectl exec -ti sst-test -c c0 -- /bin/bash --login root@sst-test:/go/builder# /go/bin/sst-ctl info ... PPCurrentLevel: 0 PPLocked: true PPMaxLevel: 4 PPSupported: true PPVersion: 3 TFEnabled: false TFSupported: true ...This sample plugin was not really meant for production. It's merely a sample plugin which demonstrate some of NRI's capabilities. Anyway, if you'd like to use this plugin or you roll your own, you can enable it permanently on your cluster nodes by symlinking it into
/opt/nri/plugins:klitkey1@emr-1:~/xfer$ sudo mkdir -p /opt/nri/plugins sudo ln -s $(pwd)/device-injector /opt/nri/plugins/10-device-injector klitkey1@emr-1:~/xfer$ sudo systemctl restart containerd # or crioAfter this you should be able to annotate your pods for device injection for testing...
In a production environment you might want to restrict somehow (for instance by namespaces) which pods can be annotated with injected devices. Also, you might want to deploy that plugin itself as a DaemonSet instead of having to install it separately on each of your worker nodes. If there is enough interest, we can consider polishing that plugin, adding any necessary mechanisms for restricting access to annotation-based device injection, etc. and creating images and other deployment artifacts for it within the
nri-pluginsrepo.
Hi, Thanks for your detailed reply. According to your message, maybe we have two options for this issue.
- Start a device-injector plugin before the NRI pod starts.
- Polishing the plugin and adding some necessary mechanisms.
Closing this as ancient stale.