lxd
lxd copied to clipboard
security.privileged + ubuntu-daily:noble doesn't work - systemd services fail to start
Required information
- Distribution: Ubuntu
- Distribution version: 22.04
- The output of "snap list --all lxd core20 core22 core24 snapd":
core20 20231123 2105 latest/stable canonical✓ base,disabled
core20 20240111 2182 latest/stable canonical✓ base
core22 20231123 1033 latest/stable canonical✓ base,disabled
core22 20240111 1122 latest/stable canonical✓ base
lxd 5.20-a8d6c52 26955 latest/stable canonical✓ disabled
lxd 5.20-f3dd836 27049 latest/stable canonical✓ -
snapd 2.60.4 20290 latest/stable canonical✓ snapd,disabled
snapd 2.61.1 20671 latest/stable canonical✓ snapd
The output of "lxc info"
config: {} api_extensions: - storage_zfs_remove_snapshots - container_host_shutdown_timeout - container_stop_priority - container_syscall_filtering - auth_pki - container_last_used_at - etag - patch - usb_devices - https_allowed_credentials - image_compression_algorithm - directory_manipulation - container_cpu_time - storage_zfs_use_refquota - storage_lvm_mount_options - network - profile_usedby - container_push - container_exec_recording - certificate_update - container_exec_signal_handling - gpu_devices - container_image_properties - migration_progress - id_map - network_firewall_filtering - network_routes - storage - file_delete - file_append - network_dhcp_expiry - storage_lvm_vg_rename - storage_lvm_thinpool_rename - network_vlan - image_create_aliases - container_stateless_copy - container_only_migration - storage_zfs_clone_copy - unix_device_rename - storage_lvm_use_thinpool - storage_rsync_bwlimit - network_vxlan_interface - storage_btrfs_mount_options - entity_description - image_force_refresh - storage_lvm_lv_resizing - id_map_base - file_symlinks - container_push_target - network_vlan_physical - storage_images_delete - container_edit_metadata - container_snapshot_stateful_migration - storage_driver_ceph - storage_ceph_user_name - resource_limits - storage_volatile_initial_source - storage_ceph_force_osd_reuse - storage_block_filesystem_btrfs - resources - kernel_limits - storage_api_volume_rename - macaroon_authentication - network_sriov - console - restrict_devlxd - migration_pre_copy - infiniband - maas_network - devlxd_events - proxy - network_dhcp_gateway - file_get_symlink - network_leases - unix_device_hotplug - storage_api_local_volume_handling - operation_description - clustering - event_lifecycle - storage_api_remote_volume_handling - nvidia_runtime - container_mount_propagation - container_backup - devlxd_images - container_local_cross_pool_handling - proxy_unix - proxy_udp - clustering_join - proxy_tcp_udp_multi_port_handling - network_state - proxy_unix_dac_properties - container_protection_delete - unix_priv_drop - pprof_http - proxy_haproxy_protocol - network_hwaddr - proxy_nat - network_nat_order - container_full - candid_authentication - backup_compression - candid_config - nvidia_runtime_config - storage_api_volume_snapshots - storage_unmapped - projects - candid_config_key - network_vxlan_ttl - container_incremental_copy - usb_optional_vendorid - snapshot_scheduling - snapshot_schedule_aliases - container_copy_project - clustering_server_address - clustering_image_replication - container_protection_shift - snapshot_expiry - container_backup_override_pool - snapshot_expiry_creation - network_leases_location - resources_cpu_socket - resources_gpu - resources_numa - kernel_features - id_map_current - event_location - storage_api_remote_volume_snapshots - network_nat_address - container_nic_routes - rbac - cluster_internal_copy - seccomp_notify - lxc_features - container_nic_ipvlan - network_vlan_sriov - storage_cephfs - container_nic_ipfilter - resources_v2 - container_exec_user_group_cwd - container_syscall_intercept - container_disk_shift - storage_shifted - resources_infiniband - daemon_storage - instances - image_types - resources_disk_sata - clustering_roles - images_expiry - resources_network_firmware - backup_compression_algorithm - ceph_data_pool_name - container_syscall_intercept_mount - compression_squashfs - container_raw_mount - container_nic_routed - container_syscall_intercept_mount_fuse - container_disk_ceph - virtual-machines - image_profiles - clustering_architecture - resources_disk_id - storage_lvm_stripes - vm_boot_priority - unix_hotplug_devices - api_filtering - instance_nic_network - clustering_sizing - firewall_driver - projects_limits - container_syscall_intercept_hugetlbfs - limits_hugepages - container_nic_routed_gateway - projects_restrictions - custom_volume_snapshot_expiry - volume_snapshot_scheduling - trust_ca_certificates - snapshot_disk_usage - clustering_edit_roles - container_nic_routed_host_address - container_nic_ipvlan_gateway - resources_usb_pci - resources_cpu_threads_numa - resources_cpu_core_die - api_os - container_nic_routed_host_table - container_nic_ipvlan_host_table - container_nic_ipvlan_mode - resources_system - images_push_relay - network_dns_search - container_nic_routed_limits - instance_nic_bridged_vlan - network_state_bond_bridge - usedby_consistency - custom_block_volumes - clustering_failure_domains - resources_gpu_mdev - console_vga_type - projects_limits_disk - network_type_macvlan - network_type_sriov - container_syscall_intercept_bpf_devices - network_type_ovn - projects_networks - projects_networks_restricted_uplinks - custom_volume_backup - backup_override_name - storage_rsync_compression - network_type_physical - network_ovn_external_subnets - network_ovn_nat - network_ovn_external_routes_remove - tpm_device_type - storage_zfs_clone_copy_rebase - gpu_mdev - resources_pci_iommu - resources_network_usb - resources_disk_address - network_physical_ovn_ingress_mode - network_ovn_dhcp - network_physical_routes_anycast - projects_limits_instances - network_state_vlan - instance_nic_bridged_port_isolation - instance_bulk_state_change - network_gvrp - instance_pool_move - gpu_sriov - pci_device_type - storage_volume_state - network_acl - migration_stateful - disk_state_quota - storage_ceph_features - projects_compression - projects_images_remote_cache_expiry - certificate_project - network_ovn_acl - projects_images_auto_update - projects_restricted_cluster_target - images_default_architecture - network_ovn_acl_defaults - gpu_mig - project_usage - network_bridge_acl - warnings - projects_restricted_backups_and_snapshots - clustering_join_token - clustering_description - server_trusted_proxy - clustering_update_cert - storage_api_project - server_instance_driver_operational - server_supported_storage_drivers - event_lifecycle_requestor_address - resources_gpu_usb - clustering_evacuation - network_ovn_nat_address - network_bgp - network_forward - custom_volume_refresh - network_counters_errors_dropped - metrics - image_source_project - clustering_config - network_peer - linux_sysctl - network_dns - ovn_nic_acceleration - certificate_self_renewal - instance_project_move - storage_volume_project_move - cloud_init - network_dns_nat - database_leader - instance_all_projects - clustering_groups - ceph_rbd_du - instance_get_full - qemu_metrics - gpu_mig_uuid - event_project - clustering_evacuation_live - instance_allow_inconsistent_copy - network_state_ovn - storage_volume_api_filtering - image_restrictions - storage_zfs_export - network_dns_records - storage_zfs_reserve_space - network_acl_log - storage_zfs_blocksize - metrics_cpu_seconds - instance_snapshot_never - certificate_token - instance_nic_routed_neighbor_probe - event_hub - agent_nic_config - projects_restricted_intercept - metrics_authentication - images_target_project - cluster_migration_inconsistent_copy - cluster_ovn_chassis - container_syscall_intercept_sched_setscheduler - storage_lvm_thinpool_metadata_size - storage_volume_state_total - instance_file_head - instances_nic_host_name - image_copy_profile - container_syscall_intercept_sysinfo - clustering_evacuation_mode - resources_pci_vpd - qemu_raw_conf - storage_cephfs_fscache - network_load_balancer - vsock_api - instance_ready_state - network_bgp_holdtime - storage_volumes_all_projects - metrics_memory_oom_total - storage_buckets - storage_buckets_create_credentials - metrics_cpu_effective_total - projects_networks_restricted_access - storage_buckets_local - loki - acme - internal_metrics - cluster_join_token_expiry - remote_token_expiry - init_preseed - storage_volumes_created_at - cpu_hotplug - projects_networks_zones - network_txqueuelen - cluster_member_state - instances_placement_scriptlet - storage_pool_source_wipe - zfs_block_mode - instance_generation_id - disk_io_cache - amd_sev - storage_pool_loop_resize - migration_vm_live - ovn_nic_nesting - oidc - network_ovn_l3only - ovn_nic_acceleration_vdpa - cluster_healing - instances_state_total - auth_user - security_csm - instances_rebuild - numa_cpu_placement - custom_volume_iso - network_allocations - storage_api_remote_volume_snapshot_copy - zfs_delegate - operations_get_query_all_projects - metadata_configuration - syslog_socket - event_lifecycle_name_and_project - instances_nic_limits_priority - disk_initial_volume_configuration - operation_wait - cluster_internal_custom_volume_copy - disk_io_bus - storage_cephfs_create_missing - instance_move_config api_status: stable api_version: "1.0" auth: trusted public: false auth_methods: - tls auth_user_name: peat auth_user_method: unix environment: addresses: [] architectures: - x86_64 - i686 certificate: | -----BEGIN CERTIFICATE----- MIICIzCCAaqgAwIBAgIQR4idz3JyFUf1B/1wk081ajAKBggqhkjOPQQDAzA/MRww GgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMR8wHQYDVQQDDBZyb290QHBlYXQt bG52bGUtdWJ1bnR1MB4XDTIxMDMxNTIwMDkyNloXDTMxMDMxMzIwMDkyNlowPzEc MBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEfMB0GA1UEAwwWcm9vdEBwZWF0 LWxudmxlLXVidW50dTB2MBAGByqGSM49AgEGBSuBBAAiA2IABErSygFREtO0MuBn BrHECWEZUwrJdQYqzokwNorYN5OJ74/lz8DhKmTqymNUZS4NPCRFYDFjYwUwacGF h81kwyZcR8jM0Eqsi5J27vUjgVm718BEVyV//yoxh+ydVRAALaNrMGkwDgYDVR0P AQH/BAQDAgWgMBMGA1UdJQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwNAYD VR0RBC0wK4IRcGVhdC1sbnZsZS11YnVudHWHBH8AAAGHEAAAAAAAAAAAAAAAAAAA AAEwCgYIKoZIzj0EAwMDZwAwZAIwTzkGM6pvfovxrsRYN5D8GfFXtbF+mQgV0kc4 Mlr4WlGMtbmFSimiUYQCzjrY2kC+AjAIxjJWEELFqk8jHleQPZLwXrNGJRRLzt3r gETbNFBfy+2Q1b/Mtpl1FNaXWuzZJNo= -----END CERTIFICATE----- certificate_fingerprint: b336373d275723a4f912e737e3630caeb4b3e54030a5c98840dc090663ca4c25 driver: lxc | qemu driver_version: 5.0.3 | 8.1.3 firewall: nftables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "true" netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" uevent_injection: "true" unpriv_fscaps: "true" kernel_version: 6.5.0-21-generic lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Ubuntu os_version: "22.04" project: default server: lxd server_clustered: false server_event_mode: full-mesh server_name: peat-lnvle-ubuntu server_pid: 49189 server_version: "5.20" storage: btrfs storage_version: 5.16.2 storage_supported_drivers: - name: cephfs version: 17.2.6 remote: true - name: cephobject version: 17.2.6 remote: true - name: dir version: "1" remote: false - name: lvm version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.48.0 remote: false - name: zfs version: 2.2.0-0ubuntu1~23.10 remote: false - name: btrfs version: 5.16.2 remote: false - name: ceph version: 17.2.6 remote: true
Issue description
Setting security.privileged = true
config to a container of ubuntu-daily:noble
will make it "fails to start". By fail to start, I mean lot of services will not start, including systemd-tmpfiles-setup-dev.service
, systemd-resolved.service
and systemd-networkd.service
. The errors include "systemd-resolved.service: Failed to set up credentials: Protocol error" and "systemd-networkd.service: Failed to set up mount namespacing: Permission denied".
This may be related to https://github.com/lxc/lxc/issues/4402 and seems to be related to AppArmor.
Steps to reproduce
-
lxc init ubuntu-daily:noble noble-test
-
lxc config set noble-test security.privileged true
-
lxc start noble-test
-
lxc exec noble-test -- journalctl --boot
-- sees a lot of failures.
Information to attach
- [x] Any relevant kernel output (
dmesg
): dmesg.log - [x] Container log (
lxc info NAME --show-log
)
Log:
lxc noble-test 20240226191344.130 ERROR conf - ../src/src/lxc/conf.c:turn_into_dependent_mounts:3948 - No such file or directory - Failed to recursively turn old root mount tree into dependent mount. Continuing...
-
[x] Container configuration (
lxc config show NAME --expanded
): container-config.yml.txt -
[x] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
time="2024-02-27T01:39:13+07:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
- [x] Output of the client with --debug: lxd-lxc-start.log
- [x] Output of the daemon with --debug (alternatively output of
lxc monitor
while reproducing the issue): lxd-lxc-monitor.log
@mihalicyn is this related to the known issue with apparmor parser bug + LXD's workaround apparmor profile and recent versions of systemd?
Hi @peat-psuwit!
Thanks a lot for your report.
Yes, we are aware of some issues with AppArmor in case when privileged container is used. We strongly recommend to always use unprivileged containers as it's safer and also way more stable.
As a workaround, I can suggest you to do:
lxc config set noble-test security.nesting=true
This should help to start Ubuntu Noble.
But be aware, that this is not fully safe, as theoretically user can mount any file system inside the container when security.nesting
is enabled together with security.privileged
(for unprivileged case nesting is safe!). I'm not aware of any practical exploit for it but, I just want to warn you. It's in our plan to do something with this AppArmor issues, but, unfortunately not everything depends on us because AppArmor is an external tool.
Let's be in touch on this. And thanks again for reporting!
Kind regards, Alex
@mihalicyn do you have a link to the reported apparmor issue from 2016?
@mihalicyn my understanding is that we can address this issue once LXD's snap has a newer apparmor rule parser, which will occur when we switch the base snap to core24. I'll mark this as "later" for now.
Hi @mihalicyn
The security.nesting
workaround works. I don't worry about security too much, as our usecase is for development and requires mounting user's directory into the container anyway.
Our script [1], which was written some time ago, specifically asks for privileged container if LXD is detected to be a snap, presumably because it won't read host's /etc/sub{uid,gid}
. But a bit of research shows that one can set a custom idmap for the container [2], so I might go that route instead.
[1] If it rings anyone's bell, this is Crossbuilder. [2] https://documentation.ubuntu.com/lxd/en/latest/userns-idmap/#custom-idmaps
I did some additional investigation and found, than systemd these days want's even more than just changing a mount propagation flags. It also wants to rbind /
, do pivot_root and stuff. So, effectively, nesting must be enabled to make systemd happy anyways. Which defeats security model of the privileged containers completely. :-(
Patch example (only for experimental use!):
diff --git a/lxd/apparmor/instance_lxc.go b/lxd/apparmor/instance_lxc.go
index d5c9470ad..2eefa63ac 100644
--- a/lxd/apparmor/instance_lxc.go
+++ b/lxd/apparmor/instance_lxc.go
@@ -85,14 +85,14 @@ profile "{{ .name }}" flags=(attach_disconnected,mediate_deleted) {
mount fstype=tmpfs,
# Allow limited modification of mount propagation
- mount options=(rw,slave) -> /,
- mount options=(rw,rslave) -> /,
- mount options=(rw,shared) -> /,
- mount options=(rw,rshared) -> /,
- mount options=(rw,private) -> /,
- mount options=(rw,rprivate) -> /,
- mount options=(rw,unbindable) -> /,
- mount options=(rw,runbindable) -> /,
+ mount options=(rw,slave) -> **,
+ mount options=(rw,rslave) -> **,
+ mount options=(rw,shared) -> **,
+ mount options=(rw,rshared) -> **,
+ mount options=(rw,private) -> **,
+ mount options=(rw,rprivate) -> **,
+ mount options=(rw,unbindable) -> **,
+ mount options=(rw,runbindable) -> **,
# Allow various ro-bind-*re*-mounts of anything except /proc, /sys and /dev/.lxc
mount options=(ro,remount,bind) /[^spd]*{,/**},
@@ -296,6 +296,33 @@ profile "{{ .name }}" flags=(attach_disconnected,mediate_deleted) {
mount options=(rw,rbind) /sy[^s]*{,/**},
mount options=(rw,rbind) /sys?*{,/**},
+ # workaround modern systemd (unsafe!)
+ mount options=(rw,bind) /,
+ mount options=(rw,bind) /**,
+ mount options=(rw,rbind) /,
+ mount options=(rw,rbind) /**,
+ # Allow common combinations of bind/remount
+ # NOTE: AppArmor bug effectively turns those into wildcards mount allow
+ mount options=(ro,remount,bind),
+ mount options=(ro,remount,bind,nodev),
+ mount options=(ro,remount,bind,nodev,nosuid),
+ mount options=(ro,remount,bind,noexec),
+ mount options=(ro,remount,bind,noexec,nodev),
+ mount options=(ro,remount,bind,nosuid),
+ mount options=(ro,remount,bind,nosuid,nodev),
+ mount options=(ro,remount,bind,nosuid,noexec),
+ mount options=(ro,remount,bind,nosuid,noexec,nodev),
+ mount options=(ro,remount,bind,noatime),
+ mount options=(ro,remount,bind,noatime,nodev),
+ mount options=(ro,remount,bind,noatime,noexec),
+ mount options=(ro,remount,bind,noatime,nosuid),
+ mount options=(ro,remount,bind,noatime,noexec,nodev),
+ mount options=(ro,remount,bind,noatime,nosuid,nodev),
+ mount options=(ro,remount,bind,noatime,nosuid,noexec),
+ mount options=(ro,remount,bind,noatime,nosuid,noexec,nodev),
+ mount options=(ro,remount,bind,nosuid,noexec,strictatime),
+ mount options=(ro,remount,nosuid,noexec,strictatime),
+
# Allow moving mounts except for /proc, /sys and /dev/.lxc
mount options=(rw,move) /[^spd]*{,/**},
mount options=(rw,move) /d[^e]*{,/**},
The
security.nesting
workaround works. I don't worry about security too much, as our usecase is for development and requires mounting user's directory into the container anyway.
That's a perfectly valid example of usage for privileged containers. In any other case it's strongly recommended to use unprivileged ones.
Just for the future reference, when debugging cases like that it usually makes sense to check:
-
dmesg | grep DENIED
- if
lxc config set <CT NAME> security.nesting=true
helps. If not, then p.3 - if
lxc config set <CT NAME> raw.lxc="lxc.apparmor.profile = unconfined"
helps.
@tomponline bug link is https://bugs.launchpad.net/apparmor/+bug/1597017
This issue is a real pain. Because as it was said above, systemd now created mount namespaces by default and performs recursive bindmount of /
(inside the container) to some other path. Our AppArmor profile (without nesting enabled) blocks this, because having this allowed defeats the path-based AppArmor restrictions completely.
See, for example:
# Block dangerous paths under /proc/sys
deny /proc/sys/[^kn]*{,/**} wklx,
deny /proc/sys/k[^e]*{,/**} wklx,
deny /proc/sys/ke[^r]*{,/**} wklx,
deny /proc/sys/ker[^n]*{,/**} wklx,
deny /proc/sys/kern[^e]*{,/**} wklx,
deny /proc/sys/kerne[^l]*{,/**} wklx,
If attacker wants to bypass this and can do recursive bindmount, then he/she can:
mkdir /mnt/attacker_playground
mount --rbind / /mnt/attacker_playground
# cool! Now we can access /proc/sys/kernel through /mnt/attacker_playground/proc/sys/kernel/...
That's a reason why security.nesting
is not safe when used with privileged containers, at the same time absolutely safe with unprivileged.