containerd icon indicating copy to clipboard operation
containerd copied to clipboard

support quota set on overlay snapshot

Open yylt opened this issue 3 years ago • 19 comments

support set quota on overlay snapshot

as discussed in issue #3329

Fixes https://github.com/containerd/containerd/issues/3329

enable quota set by follows:

  1. Make sure the root directory of containerd is mounted with 'pquota'
  2. Set 'overlay' as the default snapshot and enable quota in 'overlay' config
  3. Set the default quota size in CRI

config.toml like this

  [plugins.'io.containerd.cri.v1.runtime']
    enable_selinux = false
    default_snapshot_quota_size = '2M'   #+
...
  [plugins.'io.containerd.snapshotter.v1.overlayfs']
    root_path = ''
    upperdir_label = false
    sync_remove = false
    slow_chown = false
    mount_options = []
    enable_quota = true  #+

check containerd root mount info

# /etc/fstab
...
/dev/mapper/containerd_root  /var/lib/containerd  xfs     defaults,pquota        0 0

# mount ...
/dev/mapper/containerd_root on /var/lib/containerd type xfs (rw,relatime,prjquota)

check container had been set quota

xfs_quota -x -c "report -h" /var/lib/containerd/

Signed-off-by: Yang Yang [email protected]

yylt avatar Aug 11 '21 03:08 yylt

Hi @yylt. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 11 '21 03:08 k8s-ci-robot

Build succeeded.

theopenlab-ci[bot] avatar Aug 11 '21 04:08 theopenlab-ci[bot]

Quota handling is a topic we should discuss at a high level first and possibly a good topic for a future community meeting. However, getting library support for setting project quotas across different backing FS makes sense to do now.

As an end feature, we need to discuss how it will be integrated with Kubernetes and how we can use it from our own tooling and go libraries. It seems backwards to have the cri layer here enable the quota then the snapshot set a static quota. I would think the quota would be set by the client adding a quota and the snapshotter enabling it via configuration and erroring out if the backing FS couldn't support it when enabled. The difficult part is figuring out from the client perspective whether quota is enabled so it knows whether it can avoid expensive ephemeral storage accounting (such as in the Kubelet).

dmcgowan avatar Aug 11 '21 18:08 dmcgowan

The difficult part is figuring out from the client perspective whether quota is enabled so it knows whether it can avoid expensive ephemeral storage accounting (such as in the Kubelet).

In Kubernetes, there is a feature gate and a check method SupportsQuotas, here is the KEP "Quotas for Ephemeral Storage" https://github.com/kubernetes/enhancements/issues/1029 from Kubernetes. https://github.com/kubernetes/kubernetes/blob/226323178e23b4c476001266beab1e2f116b3879/pkg/volume/util/fsquota/common/quota_linux_common_impl.go#L180-L190

// SupportsQuotas determines whether the filesystem supports quotas.
func SupportsQuotas(mountpoint string, qType QuotaType) (bool, error) {
	data, err := runXFSQuotaCommand(mountpoint, "state -p")
	if err != nil {
		return false, err
	}
	if qType == FSQuotaEnforcing {
		return strings.Contains(data, "Enforcement: ON"), nil
	}
	return strings.Contains(data, "Accounting: ON"), nil
}

pacoxu avatar Jun 09 '22 05:06 pacoxu

sorry for so long time later.

As an end feature, we need to discuss how it will be integrated with Kubernetes and how we can use it from our own tooling and go libraries. It seems backwards to have the cri layer here enable the quota then the snapshot set a static quota. I would think the quota would be set by the client adding a quota and the snapshotter enabling it via configuration and erroring out if the backing FS couldn't support it when enabled. The difficult part is figuring out from the client perspective whether quota is enabled so it knows whether it can avoid expensive ephemeral storage accounting (such as in the Kubelet).

it is good suggestion. now quota setter focus on overlay, and with pure go.

and update snapshots litter, add new option WithQuotaSize. mostly if the option not define, will use default size in overlayfs configurtion

such as:

  [plugins."io.containerd.snapshotter.v1.overlayfs"]
    root_path = ""
    quota_size = "128MB"

yylt avatar Jun 16 '22 07:06 yylt

I tried to build a containerd to do some test. Too many quota is created in my env.

[root@daocloud ~]# xfs_quota -xc "report -a -p -h"  | head -n 100
Project quota on /var/lib/kubelet (/dev/sdc)
                        Blocks
Project ID   Used   Soft   Hard Warn/Grace
---------- ---------------------------------
#0           820K      0      0  00 [------]
volume1048590      0     8E     8E  00 [------]
volume1048605      0     8E     8E  00 [------]
volume1048607      0     8E     8E  00 [------]
volume1048610      0     8E     8E  00 [------]

Project quota on /var/lib/containerd (/dev/sdb)
                        Blocks
Project ID   Used   Soft   Hard Warn/Grace
---------- ---------------------------------
#0          10.2G      0      0  00 [------]
#2              0      0      0  00 [------]
#3              0    15G    15G  00 [------]
#4            12K    15G    15G  00 [------]
#5              0    15G    15G  00 [------]
#6              0    15G    15G  00 [------]
#7              0    15G    15G  00 [------]
#8              0    15G    15G  00 [------]
#9              0    15G    15G  00 [------]
#10             0    15G    15G  00 [------]
#11             0    15G    15G  00 [------]
#12             0    15G    15G  00 [------]
#13             0    15G    15G  00 [------]
#14             0    15G    15G  00 [------]
#15             0    15G    15G  00 [------]
#16             0    15G    15G  00 [------]
#17             0    15G    15G  00 [------]
#18             0    15G    15G  00 [------]
#19             0    15G    15G  00 [------]
#20             0    15G    15G  00 [------]
#21             0    15G    15G  00 [------]
#22             0    15G    15G  00 [------]
#23             0    15G    15G  00 [------]
#24            4K    15G    15G  00 [------]
#25             0    15G    15G  00 [------]
#26             0    15G    15G  00 [------]
#27             0    15G    15G  00 [------]
#28             0    15G    15G  00 [------]
#29            4K    15G    15G  00 [------]
#30             0    15G    15G  00 [------]
#31             0    15G    15G  00 [------]
#32             0    15G    15G  00 [------]
#33             0    15G    15G  00 [------]
#34             0    15G    15G  00 [------]
#35            4K    15G    15G  00 [------]
#36            4K    15G    15G  00 [------]
#37             0    15G    15G  00 [------]
#38          540K    15G    15G  00 [------]
#39             0    15G    15G  00 [------]
#40             0    15G    15G  00 [------]
#41             0    15G    15G  00 [------]
#42             0    15G    15G  00 [------]
#43             0    10G    10G  00 [------]
#44             0    10G    10G  00 [------]
#45             0    10G    10G  00 [------]
#46             0    10G    10G  00 [------]
#47             0    10G    10G  00 [------]
#48             0    10G    10G  00 [------]
#49             0    10G    10G  00 [------]
#50             0    10G    10G  00 [------]
#51             0    10G    10G  00 [------]
#52             0    10G    10G  00 [------]
#53             0    10G    10G  00 [------]
#54             0    10G    10G  00 [------]
#55             0    10G    10G  00 [------]
#56             0    10G    10G  00 [------]
#57             0    10G    10G  00 [------]
#58             0    10G    10G  00 [------]
#59             0    10G    10G  00 [------]
#60             0    10G    10G  00 [------]
#61             0    10G    10G  00 [------]
#62             0    10G    10G  00 [------]
#63             0    10G    10G  00 [------]

I set the size to 10G at first and 15G later.

[root@daocloud ~]# xfs_quota -xc "report -a -p -h"  | wc -l
1956507

yes, it becauce set quota committed layer too. but it is not expected

yylt avatar Jun 27 '22 09:06 yylt

I got some new errors with latest code.

7月 06 14:57:03 paco containerd[10220]: time="2022-07-06T14:57:03.299383030+08:00" level=info msg="loading plugin "io.containerd.snapshotter.v1.overlayfs"..." type=io.containerd.snapshotter.v1 7月 06 14:57:03 paco containerd[10220]: time="2022-07-06T14:57:03.299968488+08:00" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.overlayfs" error="Failed to set quota limit for projid 1 on /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/backingFsBlockDev: function not implemented" 7月 06 14:57:03 paco containerd[10220]: time="2022-07-06T14:57:03.300169800+08:00" level=warning msg="could not use snapshotter overlayfs in metadata plugin" error="Failed to set quota limit for projid 1 on /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/backingFsBlockDev: function not implemented"

And the quota is not set as expected.

pacoxu avatar Jul 06 '22 06:07 pacoxu

I got some new errors with latest code.

7月 06 14:57:03 paco containerd[10220]: time="2022-07-06T14:57:03.299383030+08:00" level=info msg="loading plugin "io.containerd.snapshotter.v1.overlayfs"..." type=io.containerd.snapshotter.v1 7月 06 14:57:03 paco containerd[10220]: time="2022-07-06T14:57:03.299968488+08:00" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.overlayfs" error="Failed to set quota limit for projid 1 on /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/backingFsBlockDev: function not implemented" 7月 06 14:57:03 paco containerd[10220]: time="2022-07-06T14:57:03.300169800+08:00" level=warning msg="could not use snapshotter overlayfs in metadata plugin" error="Failed to set quota limit for projid 1 on /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/backingFsBlockDev: function not implemented"

And the quota is not set as expected.

Hi, could you give some suggestion?

I try the block device disappeared, but it is not. and then, guess the GC on uintptr var and result in failed, but it is not trueth.

yylt avatar Jul 27 '22 09:07 yylt

I try command GOGC=off containerd -c myconfig.toml, and it always successed, and use code below, but it does not work.

	_, _, errno := syscall.Syscall6(syscall.SYS_QUOTACTL, uintptr(qcmd(Q_XSETQLIM, XFS_PROJ_QUOTA)),
		(*[2]uintptr)(unsafe.Pointer(&backingFsBlockDev))[0], uintptr(projectID),
		uintptr(unsafe.Pointer(&fd)), 0, 0)

and use bpftrace to debug, and the syscall _sys_quotactl retval could be 0, -2 , -20.

bpftrace -e 'kretprobe:__x64_sys_quotactl {printf(%s, retval %d\n",kstack,retval)}'

finally, use below to translate string to uintptr could be work fine.

	devbyte := append([]byte(backingFsBlockDev), 0)

	_, _, errno := syscall.Syscall6(syscall.SYS_QUOTACTL, uintptr(qcmd(Q_XSETQLIM, XFS_PROJ_QUOTA)),
		uintptr(unsafe.Pointer(&devbyte[0])), uintptr(fd.id),
		uintptr(unsafe.Pointer(&fd)), 0, 0)

yylt avatar Jul 29 '22 08:07 yylt

@yylt Do you have time to follow up on this? If not, I will work based on this PR later.

Feel free to close mine if you have time again.

pacoxu avatar Mar 06 '24 03:03 pacoxu

@yylt Do you have time to follow up on this? If not, I will work based on this PR later.

Feel free to close mine if you have time again.

Yes, have time to complete the feature

But can you provide more guidance on the subsequent content aside from rebasing the branch, other than submitting the code, I'm not sure if more discussion is needed? @pacoxu

yylt avatar Mar 06 '24 04:03 yylt

Yes, have time to complete the feature

Great.

But can you provide more guidance on the subsequent content aside from rebasing the branch, other than submitting the code, I'm not sure if more discussion is needed?

Rebase is needed at first.

I am not sure about the future of this feature. But we have similar requirements as well, users want a similar behavior like docker in this case.

  • From kubernetes perspective, pod will be evicted if pod disk usage reached the limit in empty dir.

Questions we need to discuss about this feature:

  1. Collect clearer user stories: as I know, some users want a limited pod disk and the pod can work even it reached the limit.(something like a pod running in maintenance mode, and also avoid making node disk in pressure.) Some actions like backup/archive/save-checkpoint are needed before removing the pod with full disk.
  2. Should we support it in CRI? Then kubelet can use this feature to define the quota later. It can be a configuration per node or per pod, or even per container.
    • Is this a containerd configuration? Will CRI overwrite the configuration?
  3. I am not sure if this was discussed in containerd meeting before. @dmcgowan @AkihiroSuda
  4. Is this the only solution? I heard that there is another proposal that is based on blockfile snapshotter to do similar limit for disk and IO.

Discussions should be in the issue IMO. So I commented in https://github.com/containerd/containerd/issues/759#issuecomment-1982610845.

pacoxu avatar Mar 06 '24 08:03 pacoxu

The main implementation steps for now are as follows:

  1. The snapshot is used to enable quota. If it cannot be enabled, an error will be reported and the process will exit.
  2. CRI is used to set the default quota size. When the snapshot supports quota configuration, quota will be set. When the snapshot does not support it, nothing will be done

@dmcgowan @pacoxu If there is more discussion or suggestions, I am ready to make modifications :)

yylt avatar Mar 07 '24 08:03 yylt

@AkihiroSuda Hi, all code is pure golang, have any comments yet?

yylt avatar Apr 02 '24 09:04 yylt

/retest

yylt avatar Apr 03 '24 12:04 yylt

@yylt: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 03 '24 12:04 k8s-ci-robot

Is it necessary to use the new version of ackagep in this way to use fs magic? This looks quite ugly.

require github.com/containerd/continuity v0.4.4-0.20240406122502-56a67e3ba427

yylt avatar Apr 08 '24 02:04 yylt

Is it necessary to use the new version of ackagep in this way to use fs magic? This looks quite ugly.

require github.com/containerd/continuity v0.4.4-0.20240406122502-56a67e3ba427

Can this be done in a seperate PR?

pacoxu avatar Apr 11 '24 03:04 pacoxu

Can this be done in a seperate PR?

It doesn't seem appropriate, because the overlay judgement of whether the file system supports setting quotas requires using magic fs.

yylt avatar Apr 11 '24 03:04 yylt

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 07 '24 04:05 k8s-ci-robot