nri-plugins Add privileged mode for SST

The NRI pod needs to access the /dev/isst_interface device in the privileged mode.

Aug 16 '23 08:08 changzhi1990

This pr addresses https://github.com/containers/nri-plugins/issues/101

Aug 16 '23 08:08 changzhi1990

@marquiz Do you have any comments for this pr?

Sep 11 '23 01:09 changzhi1990

Hi, @marquiz . I tried to use capabilities to grant privileges to access the /dev/isst_interface device. But I met a problem when using capabilities like:

          securityContext:
            allowPrivilegeEscalation: true
            capabilities:
              add:
                - AUDIT_CONTROL
                - AUDIT_READ
                - AUDIT_WRITE
                - BLOCK_SUSPEND
                - CHOWN
                - DAC_OVERRIDE
                - DAC_READ_SEARCH
                - FOWNER
                - FSETID
                - IPC_LOCK
                - IPC_OWNER
                - KILL
                - LEASE
                - LINUX_IMMUTABLE
                - MAC_ADMIN
                - MAC_OVERRIDE
                - MKNOD
                - NET_ADMIN
                - NET_BIND_SERVICE
                - NET_BROADCAST
                - NET_RAW
                - SETGID
                - SETFCAP
                - SETPCAP
                - SETUID
                - SYS_ADMIN
                - SYS_BOOT
                - SYS_CHROOT
                - SYS_MODULE
                - SYS_NICE
                - SYS_PACCT
                - SYS_PTRACE
                - SYS_RAWIO
                - SYS_RESOURCE
                - SYS_TIME
                - SYS_TTY_CONFIG
                - SYSLOG

In this securitycontext, I didn't know which capabilities I should add so I added all the capabilities to the nri pod. But the nri pod still reported an error:

W0912 08:17:25.346297       1 system.go:297] failed to get SST info for package 1: failed to read SST PP info: Mbox command failed with failed to open is
st device "/host/dev/isst_interface": open /host/dev/isst_interface: operation not permitted

Do you have any ideas on that?

Sep 13 '23 01:09 changzhi1990

If I don't add extra capabilities to the nri pod. The grep command output is:

sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/1340183/status
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000

sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=00000000a80425fb
WARNING: libcap needs an update (cap=40 should have a name).
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

And, if I add the extra capabilities to the nri pod. The grep command output is:

sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/1334365/status
CapInh: 0000000000000000                                                                                                                                                        CapPrm: 00000037ffffffff
CapEff: 00000037ffffffff
CapBnd: 00000037ffffffff
CapAmb: 0000000000000000

sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=00000037ffffffff
WARNING: libcap needs an update (cap=40 should have a name).
0x00000037ffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_
net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_block_suspe
nd,cap_audit_read

So, it seems that all the capabilities have already added to the nri pod. But the nri pod still can't access the /dev/isst_interface device.

Sep 13 '23 08:09 changzhi1990

So, it seems that all the capabilities have already added to the nri pod. But the nri pod still can't access the /dev/isst_interface device.

Thanks @changzhi1990 for debugging this. Fair enough, seems like we need the privileged mode, then.

I'd still like to understand why that fails. Maybe its the ambient capabilities somehow being required (although I don't quite understand why)...

Sep 13 '23 09:09 marquiz

So, it seems that all the capabilities have already added to the nri pod. But the nri pod still can't access the /dev/isst_interface device.

Thanks @changzhi1990 for debugging this. Fair enough, seems like we need the privileged mode, then.

I'd still like to understand why that fails. Maybe its the ambient capabilities somehow being required (although I don't quite understand why)...

Responding to myself: I think it's because runc uses cgroups to control access to device nodes. To make this work without privileged /dev/isst_interface would need to be added as a device to the container, not just mount :/

Sep 13 '23 13:09 marquiz

So, it seems that all the capabilities have already added to the nri pod. But the nri pod still can't access the /dev/isst_interface device.

Thanks @changzhi1990 for debugging this. Fair enough, seems like we need the privileged mode, then. I'd still like to understand why that fails. Maybe its the ambient capabilities somehow being required (although I don't quite understand why)...

Responding to myself: I think it's because runc uses cgroups to control access to device nodes. To make this work without privileged /dev/isst_interface would need to be added as a device to the container, not just mount :/

Thanks for your response. I will try this.

Sep 13 '23 14:09 changzhi1990

I have compared the capabilities of two scenarios.

Use "privileged: true"
Add all capabilities

In the scenario1,

sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ ps -ef|grep nri
root       29920   29869 15 18:24 ?        00:00:02 /bin/nri-resource-policy-topology-aware --host-root /host --fallback-config /etc/nri-resource-policy/nri-resource-policy.cfg --pid-file /tmp/nri-resource-policy.pid -metrics-interval 5s
sdp        30432    3890  0 18:24 pts/0    00:00:00 grep --color=auto nri
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/29920/status
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/29869/status
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000

sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=000001ffffffffff
WARNING: libcap needs an update (cap=40 should have a name).
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,38,39,40

In the scenario2,

sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ ps -ef|grep nri
root       25523   25395 11 18:22 ?        00:00:03 /bin/nri-resource-policy-topology-aware --host-root /host --fallback-config /etc/nri-resource-policy/nri-resource-policy.cfg --pid-file /tmp/nri-resource-policy.pid -metrics-interval 5s
sdp        27017    3890  0 18:22 pts/0    00:00:00 grep --color=auto nri
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/25523/status
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/25395/status
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=000003ffffffffff
WARNING: libcap needs an update (cap=40 should have a name).
0x000003ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,38,39,40,41
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=000001ffffffffff
WARNING: libcap needs an update (cap=40 should have a name).
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,38,39,40

The only difference between these two scenarios is that there is an extra "41" capabilities in scenario2.

Sep 14 '23 01:09 changzhi1990

Maybe.... we need an sst device plugin, like dsb device plugin, gpu device plugin, etc?

Sep 14 '23 02:09 changzhi1990

Thanks for your response. I will try this.

@changzhi1990, ah sorry, I was a bit "equivocal" on this. I was merely speaking to myself because there is nothing else than privileged mode in the current setup that can make it work.

Maybe.... we need an sst device plugin, like dsb device plugin, gpu device plugin, etc?

Yes, this would be a way to do it. Another possible solution would be running another NRI plugin before this one that would insert the isst_interface device in the resource-policy container but let's not go there yet...

Sep 14 '23 06:09 marquiz

Thanks for your response. I will try this.

@changzhi1990, ah sorry, I was a bit "equivocal" on this. I was merely speaking to myself because there is nothing else than privileged mode in the current setup that can make it work.

Maybe.... we need an sst device plugin, like dsb device plugin, gpu device plugin, etc?

Yes, this would be a way to do it. Another possible solution would be running another NRI plugin before this one that would insert the isst_interface device in the resource-policy container but let's not go there yet...

Alright, Is worth to create an sst plugin? I mean maybe there are a lot of works to do

Sep 14 '23 06:09 changzhi1990

Alright, Is worth to create an sst plugin? I mean maybe there are a lot of works to do

Talking about another NRI plugin @klihub had an idea about this (it seems that there already exists a sample plugin for this purpose, i.e. injecting a device) so that could be the solution, indeed. Let @klihub provide the details here

Sep 14 '23 07:09 marquiz

@changzhi1990 You should be able to get read-only access to /dev/sst_interface using NRI itself and an(other) NRI plugin and without requiring any cpabilities or using a privileged container. We have a sample NRI plugin which can inject devices and the necessary cgroup access rules for them based on annotations.

For instance, I used that plugin and this following pod spec to test if I can do 'SST discovery':

apiVersion: v1
kind: Pod
metadata:
  name: sst-test
  annotations:
    devices.nri.io/container.c0: |+
      - path: /dev/isst_interface
        type: c
        major: 10
        minor: 123
        file_mode: 0o600
spec:
  containers:
  - name: c0
    image: quay.io/marquiz/goresctrl:test
    command:
      - sh
      - -c
      - sleep 3600
    resources:
      requests:
        cpu: 250m
        memory: 200M
      limits:
        cpu: 500m
        memory: 200M
    securityContext:
      privileged: false
      capabilities:
        drop:
          - all
    imagePullPolicy: IfNotPresent

Just run the device-injector plugin (for instance ./device-injector --idx 10 -name device-injector manually for testing) before starting the pod. Then you should be able to do this:

klitkey1@emr-1:~/xfer$ kubectl apply -f sst-test.yaml
klitkey1@emr-1:~/xfer$ kubectl exec -ti sst-test -c c0 -- /bin/bash --login
root@sst-test:/go/builder# /go/bin/sst-ctl info
...
  PPCurrentLevel: 0
  PPLocked: true
  PPMaxLevel: 4
  PPSupported: true
  PPVersion: 3
  TFEnabled: false
  TFSupported: true
...

This sample plugin was not really meant for production. It's merely a sample plugin which demonstrate some of NRI's capabilities. Anyway, if you'd like to use this plugin or you roll your own, you can enable it permanently on your cluster nodes by symlinking it into /opt/nri/plugins:

klitkey1@emr-1:~/xfer$ sudo mkdir -p /opt/nri/plugins
sudo ln -s $(pwd)/device-injector /opt/nri/plugins/10-device-injector
klitkey1@emr-1:~/xfer$ sudo systemctl restart containerd # or crio

After this you should be able to annotate your pods for device injection for testing...

In a production environment you might want to restrict somehow (for instance by namespaces) which pods can be annotated with injected devices. Also, you might want to deploy that plugin itself as a DaemonSet instead of having to install it separately on each of your worker nodes. If there is enough interest, we can consider polishing that plugin, adding any necessary mechanisms for restricting access to annotation-based device injection, etc. and creating images and other deployment artifacts for it within the nri-plugins repo.

Sep 14 '23 08:09 klihub

@changzhi1990 You should be able to get read-only access to /dev/sst_interface using NRI itself and an(other) NRI plugin and without requiring any cpabilities or using a privileged container. We have a sample NRI plugin which can inject devices and the necessary cgroup access rules for them based on annotations.

For instance, I used that plugin and this following pod spec to test if I can do 'SST discovery':
apiVersion: v1
kind: Pod
metadata:
  name: sst-test
  annotations:
    devices.nri.io/container.c0: |+
      - path: /dev/isst_interface
        type: c
        major: 10
        minor: 123
        file_mode: 0o600
spec:
  containers:
  - name: c0
    image: quay.io/marquiz/goresctrl:test
    command:
      - sh
      - -c
      - sleep 3600
    resources:
      requests:
        cpu: 250m
        memory: 200M
      limits:
        cpu: 500m
        memory: 200M
    securityContext:
      privileged: false
      capabilities:
        drop:
          - all
    imagePullPolicy: IfNotPresent
Just run the device-injector plugin (for instance ./device-injector --idx 10 -name device-injector manually for testing) before starting the pod. Then you should be able to do this:
klitkey1@emr-1:~/xfer$ kubectl apply -f sst-test.yaml
klitkey1@emr-1:~/xfer$ kubectl exec -ti sst-test -c c0 -- /bin/bash --login
root@sst-test:/go/builder# /go/bin/sst-ctl info
...
  PPCurrentLevel: 0
  PPLocked: true
  PPMaxLevel: 4
  PPSupported: true
  PPVersion: 3
  TFEnabled: false
  TFSupported: true
...
This sample plugin was not really meant for production. It's merely a sample plugin which demonstrate some of NRI's capabilities. Anyway, if you'd like to use this plugin or you roll your own, you can enable it permanently on your cluster nodes by symlinking it into /opt/nri/plugins:
klitkey1@emr-1:~/xfer$ sudo mkdir -p /opt/nri/plugins
sudo ln -s $(pwd)/device-injector /opt/nri/plugins/10-device-injector
klitkey1@emr-1:~/xfer$ sudo systemctl restart containerd # or crio
After this you should be able to annotate your pods for device injection for testing...

In a production environment you might want to restrict somehow (for instance by namespaces) which pods can be annotated with injected devices. Also, you might want to deploy that plugin itself as a DaemonSet instead of having to install it separately on each of your worker nodes. If there is enough interest, we can consider polishing that plugin, adding any necessary mechanisms for restricting access to annotation-based device injection, etc. and creating images and other deployment artifacts for it within the nri-plugins repo.

Hi, Thanks for your detailed reply. According to your message, maybe we have two options for this issue.

Start a device-injector plugin before the NRI pod starts.
Polishing the plugin and adding some necessary mechanisms.

Sep 18 '23 00:09 changzhi1990

Closing this as ancient stale.

Aug 26 '24 11:08 klihub