lxd icon indicating copy to clipboard operation
lxd copied to clipboard

security.privileged + ubuntu-daily:noble doesn't work - systemd services fail to start

Open peat-psuwit opened this issue 1 year ago • 6 comments

Required information

  • Distribution: Ubuntu
  • Distribution version: 22.04
  • The output of "snap list --all lxd core20 core22 core24 snapd":
core20  20231123      2105   latest/stable  canonical✓  base,disabled
core20  20240111      2182   latest/stable  canonical✓  base
core22  20231123      1033   latest/stable  canonical✓  base,disabled
core22  20240111      1122   latest/stable  canonical✓  base
lxd     5.20-a8d6c52  26955  latest/stable  canonical✓  disabled
lxd     5.20-f3dd836  27049  latest/stable  canonical✓  -
snapd   2.60.4        20290  latest/stable  canonical✓  snapd,disabled
snapd   2.61.1        20671  latest/stable  canonical✓  snapd
The output of "lxc info"
config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- storage_api_remote_volume_snapshot_copy
- zfs_delegate
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: peat
auth_user_method: unix
environment:
  addresses: []
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIICIzCCAaqgAwIBAgIQR4idz3JyFUf1B/1wk081ajAKBggqhkjOPQQDAzA/MRww
    GgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMR8wHQYDVQQDDBZyb290QHBlYXQt
    bG52bGUtdWJ1bnR1MB4XDTIxMDMxNTIwMDkyNloXDTMxMDMxMzIwMDkyNlowPzEc
    MBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEfMB0GA1UEAwwWcm9vdEBwZWF0
    LWxudmxlLXVidW50dTB2MBAGByqGSM49AgEGBSuBBAAiA2IABErSygFREtO0MuBn
    BrHECWEZUwrJdQYqzokwNorYN5OJ74/lz8DhKmTqymNUZS4NPCRFYDFjYwUwacGF
    h81kwyZcR8jM0Eqsi5J27vUjgVm718BEVyV//yoxh+ydVRAALaNrMGkwDgYDVR0P
    AQH/BAQDAgWgMBMGA1UdJQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwNAYD
    VR0RBC0wK4IRcGVhdC1sbnZsZS11YnVudHWHBH8AAAGHEAAAAAAAAAAAAAAAAAAA
    AAEwCgYIKoZIzj0EAwMDZwAwZAIwTzkGM6pvfovxrsRYN5D8GfFXtbF+mQgV0kc4
    Mlr4WlGMtbmFSimiUYQCzjrY2kC+AjAIxjJWEELFqk8jHleQPZLwXrNGJRRLzt3r
    gETbNFBfy+2Q1b/Mtpl1FNaXWuzZJNo=
    -----END CERTIFICATE-----
  certificate_fingerprint: b336373d275723a4f912e737e3630caeb4b3e54030a5c98840dc090663ca4c25
  driver: lxc | qemu
  driver_version: 5.0.3 | 8.1.3
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 6.5.0-21-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "22.04"
  project: default
  server: lxd
  server_clustered: false
  server_event_mode: full-mesh
  server_name: peat-lnvle-ubuntu
  server_pid: 49189
  server_version: "5.20"
  storage: btrfs
  storage_version: 5.16.2
  storage_supported_drivers:
  - name: cephfs
    version: 17.2.6
    remote: true
  - name: cephobject
    version: 17.2.6
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.48.0
    remote: false
  - name: zfs
    version: 2.2.0-0ubuntu1~23.10
    remote: false
  - name: btrfs
    version: 5.16.2
    remote: false
  - name: ceph
    version: 17.2.6
    remote: true

Issue description

Setting security.privileged = true config to a container of ubuntu-daily:noble will make it "fails to start". By fail to start, I mean lot of services will not start, including systemd-tmpfiles-setup-dev.service, systemd-resolved.service and systemd-networkd.service. The errors include "systemd-resolved.service: Failed to set up credentials: Protocol error" and "systemd-networkd.service: Failed to set up mount namespacing: Permission denied".

This may be related to https://github.com/lxc/lxc/issues/4402 and seems to be related to AppArmor.

Steps to reproduce

  1. lxc init ubuntu-daily:noble noble-test
  2. lxc config set noble-test security.privileged true
  3. lxc start noble-test
  4. lxc exec noble-test -- journalctl --boot -- sees a lot of failures.

Information to attach

  • [x] Any relevant kernel output (dmesg): dmesg.log
  • [x] Container log (lxc info NAME --show-log)
Log:

lxc noble-test 20240226191344.130 ERROR    conf - ../src/src/lxc/conf.c:turn_into_dependent_mounts:3948 - No such file or directory - Failed to recursively turn old root mount tree into dependent mount. Continuing...
  • [x] Container configuration (lxc config show NAME --expanded): container-config.yml.txt

  • [x] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)

time="2024-02-27T01:39:13+07:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
  • [x] Output of the client with --debug: lxd-lxc-start.log
  • [x] Output of the daemon with --debug (alternatively output of lxc monitor while reproducing the issue): lxd-lxc-monitor.log

peat-psuwit avatar Feb 26 '24 19:02 peat-psuwit

@mihalicyn is this related to the known issue with apparmor parser bug + LXD's workaround apparmor profile and recent versions of systemd?

tomponline avatar Feb 27 '24 08:02 tomponline

Hi @peat-psuwit!

Thanks a lot for your report.

Yes, we are aware of some issues with AppArmor in case when privileged container is used. We strongly recommend to always use unprivileged containers as it's safer and also way more stable.

As a workaround, I can suggest you to do: lxc config set noble-test security.nesting=true

This should help to start Ubuntu Noble. But be aware, that this is not fully safe, as theoretically user can mount any file system inside the container when security.nesting is enabled together with security.privileged (for unprivileged case nesting is safe!). I'm not aware of any practical exploit for it but, I just want to warn you. It's in our plan to do something with this AppArmor issues, but, unfortunately not everything depends on us because AppArmor is an external tool.

Let's be in touch on this. And thanks again for reporting!

Kind regards, Alex

mihalicyn avatar Feb 28 '24 11:02 mihalicyn

@mihalicyn do you have a link to the reported apparmor issue from 2016?

tomponline avatar Feb 28 '24 11:02 tomponline

@mihalicyn my understanding is that we can address this issue once LXD's snap has a newer apparmor rule parser, which will occur when we switch the base snap to core24. I'll mark this as "later" for now.

tomponline avatar Feb 28 '24 11:02 tomponline

Hi @mihalicyn

The security.nesting workaround works. I don't worry about security too much, as our usecase is for development and requires mounting user's directory into the container anyway.

Our script [1], which was written some time ago, specifically asks for privileged container if LXD is detected to be a snap, presumably because it won't read host's /etc/sub{uid,gid}. But a bit of research shows that one can set a custom idmap for the container [2], so I might go that route instead.

[1] If it rings anyone's bell, this is Crossbuilder. [2] https://documentation.ubuntu.com/lxd/en/latest/userns-idmap/#custom-idmaps

peat-psuwit avatar Feb 28 '24 12:02 peat-psuwit

I did some additional investigation and found, than systemd these days want's even more than just changing a mount propagation flags. It also wants to rbind /, do pivot_root and stuff. So, effectively, nesting must be enabled to make systemd happy anyways. Which defeats security model of the privileged containers completely. :-(

Patch example (only for experimental use!):

diff --git a/lxd/apparmor/instance_lxc.go b/lxd/apparmor/instance_lxc.go
index d5c9470ad..2eefa63ac 100644
--- a/lxd/apparmor/instance_lxc.go
+++ b/lxd/apparmor/instance_lxc.go
@@ -85,14 +85,14 @@ profile "{{ .name }}" flags=(attach_disconnected,mediate_deleted) {
   mount fstype=tmpfs,
 
   # Allow limited modification of mount propagation
-  mount options=(rw,slave) -> /,
-  mount options=(rw,rslave) -> /,
-  mount options=(rw,shared) -> /,
-  mount options=(rw,rshared) -> /,
-  mount options=(rw,private) -> /,
-  mount options=(rw,rprivate) -> /,
-  mount options=(rw,unbindable) -> /,
-  mount options=(rw,runbindable) -> /,
+  mount options=(rw,slave) -> **,
+  mount options=(rw,rslave) -> **,
+  mount options=(rw,shared) -> **,
+  mount options=(rw,rshared) -> **,
+  mount options=(rw,private) -> **,
+  mount options=(rw,rprivate) -> **,
+  mount options=(rw,unbindable) -> **,
+  mount options=(rw,runbindable) -> **,
 
   # Allow various ro-bind-*re*-mounts of anything except /proc, /sys and /dev/.lxc
   mount options=(ro,remount,bind) /[^spd]*{,/**},
@@ -296,6 +296,33 @@ profile "{{ .name }}" flags=(attach_disconnected,mediate_deleted) {
   mount options=(rw,rbind) /sy[^s]*{,/**},
   mount options=(rw,rbind) /sys?*{,/**},
 
+  # workaround modern systemd (unsafe!)
+  mount options=(rw,bind) /,
+  mount options=(rw,bind) /**,
+  mount options=(rw,rbind) /,
+  mount options=(rw,rbind) /**,
+  # Allow common combinations of bind/remount
+  # NOTE: AppArmor bug effectively turns those into wildcards mount allow
+  mount options=(ro,remount,bind),
+  mount options=(ro,remount,bind,nodev),
+  mount options=(ro,remount,bind,nodev,nosuid),
+  mount options=(ro,remount,bind,noexec),
+  mount options=(ro,remount,bind,noexec,nodev),
+  mount options=(ro,remount,bind,nosuid),
+  mount options=(ro,remount,bind,nosuid,nodev),
+  mount options=(ro,remount,bind,nosuid,noexec),
+  mount options=(ro,remount,bind,nosuid,noexec,nodev),
+  mount options=(ro,remount,bind,noatime),
+  mount options=(ro,remount,bind,noatime,nodev),
+  mount options=(ro,remount,bind,noatime,noexec),
+  mount options=(ro,remount,bind,noatime,nosuid),
+  mount options=(ro,remount,bind,noatime,noexec,nodev),
+  mount options=(ro,remount,bind,noatime,nosuid,nodev),
+  mount options=(ro,remount,bind,noatime,nosuid,noexec),
+  mount options=(ro,remount,bind,noatime,nosuid,noexec,nodev),
+  mount options=(ro,remount,bind,nosuid,noexec,strictatime),
+  mount options=(ro,remount,nosuid,noexec,strictatime),
+
   # Allow moving mounts except for /proc, /sys and /dev/.lxc
   mount options=(rw,move) /[^spd]*{,/**},
   mount options=(rw,move) /d[^e]*{,/**},

The security.nesting workaround works. I don't worry about security too much, as our usecase is for development and requires mounting user's directory into the container anyway.

That's a perfectly valid example of usage for privileged containers. In any other case it's strongly recommended to use unprivileged ones.

Just for the future reference, when debugging cases like that it usually makes sense to check:

  1. dmesg | grep DENIED
  2. if lxc config set <CT NAME> security.nesting=true helps. If not, then p.3
  3. if lxc config set <CT NAME> raw.lxc="lxc.apparmor.profile = unconfined" helps.

@tomponline bug link is https://bugs.launchpad.net/apparmor/+bug/1597017

mihalicyn avatar Feb 28 '24 12:02 mihalicyn

This issue is a real pain. Because as it was said above, systemd now created mount namespaces by default and performs recursive bindmount of / (inside the container) to some other path. Our AppArmor profile (without nesting enabled) blocks this, because having this allowed defeats the path-based AppArmor restrictions completely.

See, for example:

  # Block dangerous paths under /proc/sys
  deny /proc/sys/[^kn]*{,/**} wklx,
  deny /proc/sys/k[^e]*{,/**} wklx,
  deny /proc/sys/ke[^r]*{,/**} wklx,
  deny /proc/sys/ker[^n]*{,/**} wklx,
  deny /proc/sys/kern[^e]*{,/**} wklx,
  deny /proc/sys/kerne[^l]*{,/**} wklx,

If attacker wants to bypass this and can do recursive bindmount, then he/she can:

mkdir /mnt/attacker_playground
mount --rbind / /mnt/attacker_playground
# cool! Now we can access /proc/sys/kernel through /mnt/attacker_playground/proc/sys/kernel/...

That's a reason why security.nesting is not safe when used with privileged containers, at the same time absolutely safe with unprivileged.

mihalicyn avatar Mar 19 '24 14:03 mihalicyn