topolvm icon indicating copy to clipboard operation
topolvm copied to clipboard

feat: Introduce minimum PVC allocation sizing by creating an allocation configuration + ConfigMap for topolvm-controller

Open jakobmoellerdev opened this issue 11 months ago • 21 comments

This PR introduces a minimum size of 300 MiB for any PVCs that are requested with Filesystem Mounting and with the XFS driver. It also refactors the test cases to be table driven so that its easier to modify and see whats wrong.

Why? Since we removed the 1Gi Allocation we should set sensible defaults that allow users to still make use of our rounding but that do not go so low as to break the Mounting procedure.

What was changed?

I introduced a viper-based configuration file that is by default deployed with a ConfigMap named {{ template "topolvm.fullname" . }}-controller and that contains a file named topolvm-controller-config.yaml. This file can contain all flags from the commandline + additional fields (such as the allocation configuration) that would be hard/impossible to set via command-line because they are too nested. The flags in the commandline always overwrite the values in the config-file if conflicts arise. The file can be located in the working directory of the binary or in /etc/topolvm. (I originally tried to reflect all of what I wanted in command line flags only but realised that the fields would be way too convoluted)

The default allocation configuration portion for minimum sizes in the file looks like this:

controller-server-settings:
    allocation:
      minimum:
        # default minimum allocation sizes are used if not specified per device classes.
        default:
          # PVCs with VolumeMode Block will be allocated with this size at minimum.
          block: 100Mi
          # PVCs with VolumeMode Filesystem will be allocated with these sizes at minimum.
          filesystem:
            xfs: 300Mi
            ext4: 100Mi
#      deviceClasses:
#        ssd:
#          block: 1Gi
#          filesystem:
#            xfs: 20Mi

Defaults for ext4 and xfs are provided in the Helm Chart, while btrfs or other filesystems can be added later on. Device Class specific configuration allows specifying different defaults for different storage configurations.

Note that setting 0 or negative values in these configurations disables the allocation sizing for that configuration.

How to reproduce? Create a PVC with something like

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: simple-pvc
spec:
  storageClassName: sg1 # should be with default or explicit xfs setting
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Mi

One will notice the following error during mounting

Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    2m4s                default-scheduler  Successfully assigned default/test-pod to cnv-qe-infra-23.cnvqe2.lab.eng.rdu2.redhat.com
  Warning  FailedMount  59s (x8 over 2m4s)  kubelet            MountVolume.SetUp failed for volume "pvc-1bba5d94-aa42-4f8e-a75e-fe16debdbd9a" : rpc error: code = Internal desc = mount failed: volume=17c48e86-3a9b-4ef0-83c1-698e0d6fd7b5, error=format of disk "/dev/topolvm/17c48e86-3a9b-4ef0-83c1-698e0d6fd7b5" failed: type:("xfs") target:("/var/lib/kubelet/pods/8b4ae8e7-55c0-45d5-805b-02496d66b782/volumes/kubernetes.io~csi/pvc-1bba5d94-aa42-4f8e-a75e-fe16debdbd9a/mount") options:("nouuid,defaults") errcode:(exit status 1) output:(size 3072 of data subvolume is too small, minimum 4096 blocks
....  (cut for brevity)

Additionally if one then increases that block amount to lets say 16 MiB one will eventually face (with a recent enough version of mkfs):

Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    2m4s                default-scheduler  Successfully assigned default/test-pod to cnv-qe-infra-23.cnvqe2.lab.eng.rdu2.redhat.com
  Warning  FailedMount  59s (x8 over 2m4s)  kubelet            MountVolume.SetUp failed for volume "pvc-1bba5d94-aa42-4f8e-a75e-fe16debdbd9a" : rpc error: code = Internal desc = mount failed: volume=17c48e86-3a9b-4ef0-83c1-698e0d6fd7b5, error=format of disk "/dev/topolvm/17c48e86-3a9b-4ef0-83c1-698e0d6fd7b5" failed: type:("xfs") target:("/var/lib/kubelet/pods/8b4ae8e7-55c0-45d5-805b-02496d66b782/volumes/kubernetes.io~csi/pvc-1bba5d94-aa42-4f8e-a75e-fe16debdbd9a/mount") options:("nouuid,defaults") errcode:(exit status 1) output:(Filesystem must be larger than 300MB.)
....  (cut for brevity)

due to the change here: https://www.spinics.net/lists/linux-xfs/msg63099.html & https://www.spinics.net/lists/linux-xfs/msg63831.html

jakobmoellerdev avatar Mar 06 '24 12:03 jakobmoellerdev

IMO, this kind of problem should be handled in the filesystem layer. Since this problem seems not to happen in the latest mkfs.xfs, I'd like to keep the current code as simple as possibile than applying this PR. @llamerada-jp How do you think?

satoru-takeuchi avatar Mar 07 '24 05:03 satoru-takeuchi

Just to make clear that this will happen in the latest versions of mkfs with 300 MiB. Before it would happen with 16 MiB and the Filesystem would be almost unusable. I can also introduce an E2E case to confirm this

jakobmoellerdev avatar Mar 07 '24 08:03 jakobmoellerdev

Also could you kindly tell what you mean with the "filesystem layer"? AFAIK the only alternative to the rounding would be to resize the volume during the mount procedure.

A Pod using a PVC with the StorageClass is completely unaware of this limitation.

jakobmoellerdev avatar Mar 07 '24 08:03 jakobmoellerdev

Originally, the minimum allocation size was 1 GiB. Therefore, I do not object to setting a new minimum allocation. However, we may avoid hard-coding these numbers. Some sites may not like to be allocated without their permission, and other file systems may have similar limitations. I do not want to make similar changes every time that happens. Therefore, how about specifying this number in the argument of the program or storage class? Is this an excessive function?

llamerada-jp avatar Mar 07 '24 09:03 llamerada-jp

@llamerada-jp I am not opposed to that idea, I think this is a good point. My only question would be is how we should set this up. From what I understand the only way to give dynamic sizing configuration would be either through a new set of flags/env variables, or through a new custom configuration in lvmd (in which case however we wouldn't work through the CSI spec). We are currently missing detailed ways to configure sizing limitations in the gRPC controller logic.

I am open to ideas here. I would like to keep the configuration ability to configure not only flat limits, but also per VolumeMode/Type for specific Filesystems.

I am also open to discuss a Design Doc and am willing to take the implementation, just need to know what preferences we have here.

EDIT: We could start introducing a set of fields here: https://kubernetes-csi.github.io/docs/external-provisioner.html?highlight=parameters#storageclass-parameters

with a pattern like size.minimum.topolvm.io/xfs=300Mi and then use this as a dynamic logic.

we could set

  • size.minimum.topolvm.io/xfs
  • size.minimum.topolvm.io/ext4
  • size.minimum.topolvm.io/block

jakobmoellerdev avatar Mar 07 '24 09:03 jakobmoellerdev

I am open to ideas here. I would like to keep the configuration ability to configure not only flat limits, but also per VolumeMode/Type for specific Filesystems.

It seems to be better to set it to storage class. Since fstype has already been set in the storage class, I think it is enough to add the storage class parameter as follows.

parameters:
  csi.storage.k8s.io/fstype: xfs
  topolvm.io/minimum-allocate(or minimum-size?): 300Mi

llamerada-jp avatar Mar 07 '24 09:03 llamerada-jp

Good point. Let me accomodate this change so that we can use it dynamically.

jakobmoellerdev avatar Mar 07 '24 11:03 jakobmoellerdev

I found a little problem when using the StorageClass to set new configurations: we cannot edit the existing SC. we have to recreate the SC with the same name for adding the configuration. I believe it is only a small limitation since we can continuously use existing PV/PVC.

llamerada-jp avatar Mar 11 '24 00:03 llamerada-jp

I think this tradeoff needs to be made clear in the docs for downstreams like OpenShift, but for TopoLVM simply documenting this should be fine.

jakobmoellerdev avatar Mar 11 '24 06:03 jakobmoellerdev

I have a patch incoming for this PR but I don't have the E2E test yet, will take a bit more time

jakobmoellerdev avatar Mar 11 '24 06:03 jakobmoellerdev

@jakobmoellerdev

Also could you kindly tell what you mean with the "filesystem layer"?

Oops, sorry. I meant filesystem "tool" layer, in this case, mkfs.xfs. In addition, I misunderstood the purpose of this PR. I considered that it is to stop the creation of too small volume.

I agree with adding a new parameter to SC as Yuji said.

I have a patch incoming for this PR but I don't have the E2E test yet, will take a bit more time

I'll review the updated version, thanks.

satoru-takeuchi avatar Mar 11 '24 07:03 satoru-takeuchi

Note to reviewers: Added the limitation generically since users might also want minimum sizes for block volumes

jakobmoellerdev avatar Mar 11 '24 17:03 jakobmoellerdev

@llamerada-jp @satoru-takeuchi I have switched to using minimum-allocated-size-filesystem and minimum-allocated-size-block because the same storageclass might be used for both VolumeMode Filesystem and Block but the user might only want to limit Filesystem Volumes, not block storage. Thus we need to have these 2 flags. I also added some unit tests and made it so that negative capacities would result in 0 minimum. Please have another look

jakobmoellerdev avatar Mar 12 '24 16:03 jakobmoellerdev

using minimum-allocated-size-filesystem and minimum-allocated-size-block because the same storageclass might be used for both VolumeMode Filesystem and Block but the user might only want to limit Filesystem Volumes, not block storage

I disagree with this specification. In this case, administrators should set up two storage classes because these are different specifications.

llamerada-jp avatar Mar 13 '24 00:03 llamerada-jp

@llamerada-jp I'm not sure why a separate StorageClass is necessary for this. It is a fact that StorageClasses are used for abstraction over the Volume provisioning. If you force downstreams to create two storage classes, all users on the cluster will have to have the knowledge about which StorageClass to use in which scenario, e.g. when they specify the PVC they will have to know about which StorageClass is "safe" for their use case, or otherwise provisioning can fail.

How would cluster user (that is not maintaining the deployment of the CSIDriver or StorageClass) know that a storageclass ssd-with-filesystem-limit should be used over ssd-with-block-limit ?

It's in the kubernetes StorageClass specification that a StorageClass can and should be able to support both modes (and we do this right now). Why would a minimum allocation size suddenly change that?

Just because XFS PVCs need to be limited to 300MiB doesn't mean that the administrator should be forced to create 2 StorageClasses if he wants to allow Block Storage under this amount IMO.

This will ultimately be a real consumption problem in LVMS as well, because we currently create exactly one StorageClass for a VolumeGroup, which works out for users well because they know that the mapping is 1:1. If we rely on 2 StorageClasses, consumption gets confusing and I would like to avoid issues with that.

I originally agreed to your suggestion to accommodate dynamic limits but now already existing users cannot migrate their storageClass safely because its parameters are immutable, which will force us to do an API change in LVMS. I don't want to make consumption of this protection from broken mounts even harder to be honest.

jakobmoellerdev avatar Mar 13 '24 07:03 jakobmoellerdev

@jakobmoellerdev The use of storage class may be excessive for this issue. I suppose using the controller's arguments might be better than using SC to resolve the original issue. I apologize as you have already done the coding. What do you think?

but now already existing users cannot migrate their storageClass safely because its parameters are immutable, which will force us to do an API change in LVMS. I don't want to make consumption of this protection from broken mounts even harder to be honest.

I took this lightly, but there are probably many administrators and sites that cannot accept this risk.

llamerada-jp avatar Mar 13 '24 10:03 llamerada-jp

No worries, I agree after that discussion that a controller arg makes sense and I appreciate you walking through this with me! I think here I am facing 2 issues again though:

  1. Would it be okay for you to create 2 flags for the controller configuration? Or should we even consider an environment variable or something similar so it can be mounted from a ConfigMap?
  2. How would we integrate this into the E2E tests, we would need to restart the controller for that which I don't like

jakobmoellerdev avatar Mar 13 '24 13:03 jakobmoellerdev

Thank you for your reply.

Would it be okay for you to create 2 flags for the controller configuration?

How about specifying the minimum allocation size per xfs/ext4/btrfs/block using the controller's argument? Since most of TopoLVM's existing parameters utilize arguments. ConfigMap and environment values can be used as arguments.

How would we integrate this into the E2E tests, we would need to restart the controller for that which I don't like

My idea is to prepare an environment with a minimum allocation size for xfs of 300 MiB, for a block size of 100 MiB, and ext4 none using arguments. How about you create PVCs in this environment and check the actual size allocated?

llamerada-jp avatar Mar 14 '24 02:03 llamerada-jp

How about specifying the minimum allocation size per xfs/ext4/btrfs/block using the controller's argument? Since most of TopoLVM's existing parameters utilize arguments. ConfigMap and environment values can be used as arguments.

That works for me, I will adjust

My idea is to prepare an environment with a minimum allocation size for xfs of 300 MiB, for a block size of 100 MiB, and ext4 none using arguments. How about you create PVCs in this environment and check the actual size allocated?

What do you mean with environment? I would like to have an e2e Test running for this and the only way I know how to do this right now is by introducing a new set of helm values in the e2e tests and then creating a new lane in github actions. I think this is a bit much for this test though. Ideally I would like to be able to test multiple settings of the helm chart in different e2e tests.

jakobmoellerdev avatar Mar 14 '24 07:03 jakobmoellerdev

I had trouble specifying a configuration that was flexible enough (per device class and sensible defaults) in just command line flags, so I decided to incorporate a ConfigMap for the controller that can be used to specify more complex settings as well as flags. The flags always take precedence from the commandline. Please see the details in the PR description.

Let me know if this works out for you.

jakobmoellerdev avatar Mar 14 '24 17:03 jakobmoellerdev

I think using ConfigMap and DeviceClass are overdoing for this issue. Wouldn't it be better to specify the minimum allocated size by the controller's argument this time? When there are different OS versions for nodes in a cluster, administrators may think about changing the settings for each device class. However, in that case, I want to carefully consider the configuration method based on more specific use cases and requirements when you need it. I believe that it is good to discuss configuring it in detail again when actual requests arise.

llamerada-jp avatar Mar 15 '24 04:03 llamerada-jp

Thanks for taking a look. There is a reason I am providing deviceClass specific overrides already. LVMS is primarily used on device edge (thus the original request to go lower than 1Gi) and they (users using LVMS on the edge) have custom tuned filesystem and block device settings on the OS. In these scenarios they will want to override the defaults we give (e.g. 100MiB default for block storage is a good candidate, same goes for 100MiB ext4). The reason im opening up this PR is because for most peopple NOT using topolvm on the edge (e.g. datacenter or virtualization providers), these minimal sizes do not make sense and dont provide good defaults

I would like to keep these overrides for this reason. If one feels like the deployment is too complex via this or wants the legacy behavior one can disable the configMap completely or not set any overrides which makes the configuration easy.

If we want to provide users with a way to configure defaults without using the configMap and set the configMap default to false, that is also fine with me. I could for example offer 4 flags for default configuration (block, ext4, xfs, btrfs) in the CLI and use these by default instead of the configMap in the HELM chart.

However, I want to make it clear that this ConfigMap also solves another issue that topolvm binaries have: Currently, all topolvm binaries except lvmd are only configurable via flags which goes against 12-Factor App Rules. We want to be able to transparently configure them. I think this will also make it much easier to introduce other configuration flags for the volume creation/resizing behavior in the future.

Additionally, all of the PR behavior is tested in unit and E2E tests, so I do not believe there is a big chance of regression.

Please reconsider your decision after this as otherwise I will have no way in downstream to allow overrides for specific StorageClasses / deviceClasses.

jakobmoellerdev avatar Mar 15 '24 07:03 jakobmoellerdev

Could you tell me what custom tuned filesystem and block device settings are? Does this mean the small sized fses/bdevs like < 1GiB?

IMO, the minimum allocation size is not necessary to set for each node and device class, but is OK to be a global value. So I still consider implementing as controller's parameter is enough.

12-Factor App Rules

Do you mean the following? https://12factor.net/config

I think the implication of this is to avoid hard-coding constants and make them externally injectable. I think the argument fits this rule. I consider the operation cost is small in both (a) changing command line parameters (modify Deployment and restart the corresponding pods) and (b) creating/changing ConfigMap(environment variables) and restarting Deployment.

llamerada-jp avatar Mar 18 '24 03:03 llamerada-jp

Could you tell me what custom tuned filesystem and block device settings are? Does this mean the small sized fses/bdevs like < 1GiB?

Sure: We have 2 different use cases for the differentiation by device Class:

  1. Customers want to use LVM Block Partitions for VM bootstrapping and want to use dedicated partitions for BIOS/UEFI which are significantly smaller or larger based on OS so they want a separate set of minimums. At the same time they do not want to change the default cluster configuration.
  2. Customers want a way to have a "way out" / Exception of the filesystem limitation if they format a partition with custom inodes and block counts (for xfs, they differ from the default block size and data sections, as well as agsize with older kernels and maxpct for the inode assignment; other factors include size of the log section). This is to support old legacy workloads that now get shifted into the cloud, as well as to hyperoptimize frequently used partitions that have to stay isolated from other partitions due to compliance and device concerns.

I understand these are edge cases and I also understand that this requires good testing coverage to be accepted which I provided.

IMO, the minimum allocation size is not necessary to set for each node and device class, but is OK to be a global value. So I still consider implementing as controller's parameter is enough.

I agree that for most a controllers parameter is enough. How about I introduce controller parameters for the global settings so people dont need to use the configMap?

12-Factor App Rules

Do you mean the following? https://12factor.net/config

I think the implication of this is to avoid hard-coding constants and make them externally injectable. I think the argument fits this rule. I consider the operation cost is small in both (a) changing command line parameters (modify Deployment and restart the corresponding pods) and (b) creating/changing ConfigMap(environment variables) and restarting Deployment.

To be honest, I agree there, no objections that this shouldnt affect the decision. However config files can allow nested fields, while it is hard to implement that with pure flags.

EDIT: Even though I would like to see it implemented that way, if there is no chance I can convince you I would like to propose changing the implementation to put the config map default to false in the helm chart and to include 4 flags (default minimum for xfs,btrfs,ext4,block) and then use these for the tests.

jakobmoellerdev avatar Mar 18 '24 08:03 jakobmoellerdev

@llamerada-jp could you give me a quick heads up on your opinion on what I said above? Just want to make sure this doesn't get lost.

jakobmoellerdev avatar Mar 21 '24 08:03 jakobmoellerdev

@jakobmoellerdev We are discussing this internally and going to get back to you tomorrow.

llamerada-jp avatar Mar 21 '24 09:03 llamerada-jp

Sure: We have 2 different use cases for the differentiation by device Class:

  1. Customers want to use LVM Block Partitions for VM bootstrapping and want to use dedicated partitions for BIOS/UEFI which are significantly smaller or larger based on OS so they want a separate set of minimums. At the same time they do not want to change the default cluster configuration.

I'd like to know about "a separate set of minimums" in detail. I wonder why minimum allocation size might be changed for each deviceClass. In my understanding, the minimum allocation sizes are only for avoiding fs mount failure. So I still consider that it's a global configuration exists for each fs. Could you tell me a specific case where the minimum size differs from the global value (e.g. 300 MiB for XFS)? To be honest, I can't imagine such cases probably because I am not familiar with edge computing.

  1. Customers want a way to have a "way out" / Exception of the filesystem limitation if they format a partition with custom inodes and block counts (for xfs, they differ from the default block size and data sections, as well as agsize with older kernels and maxpct for the inode assignment; other factors include size of the log section). This is to support old legacy workloads that now get shifted into the cloud, as well as to hyperoptimize frequently used partitions that have to stay isolated from other partitions due to compliance and device concerns.

I guess what you really want to accomplish at long last is setting not only minimum size but also many fs parameters like agsize and maxpct for each device class. Does it correct?

I would like to see it implemented that way, if there is no chance I can convince you I would like to propose changing the implementation to put the config map default to false in the helm chart and to include 4 flags (default minimum for xfs,btrfs,ext4,block) and then use these for the tests.

Introducing a new ConfigMap only for testing is too much.

llamerada-jp avatar Mar 22 '24 05:03 llamerada-jp

I'd like to know about "a separate set of minimums" in detail. I wonder why minimum allocation size might be changed for each deviceClass. In my understanding, the minimum allocation sizes are only for avoiding fs mount failure. So I still consider that it's a global configuration exists for each fs. Could you tell me a specific case where the minimum size differs from the global value (e.g. 300 MiB for XFS)? To be honest, I can't imagine such cases probably because I am not familiar with edge computing.

I think you are referring to the fact that xfs configuration would be global. And that is true for one node. However if we have 2 deviceClasses for 2 different nodes or nodeSets, then this changes. If Node A has a different fs default than node B then I would have to use 2 different controller configurations.

However I did notice that I did not consider that most users who want custom formatting could just use block devices and format the devices themselves. A common way to optimize formatting of ext4 for example is through disabling certain options that are not necessary on multi-gig devices: ^has_journal,^uninit_bg,^ext_attr,^huge_file,^64bit. This means that technically only deviceClass specifics for block devices would be necessary. And even I consider that a big edge case

I guess what you really want to accomplish at long last is setting not only minimum size but also many fs parameters like agsize and maxpct for each device class. Does it correct?

Yes that is correct. But since TopoLVM cannot do this right now the workaround for users is to create block devices and do fs mount/unmount themselves

Introducing a new ConfigMap only for testing is too much.

The ConfigMap (at least in my opinion) is a much better way to configure an application than using CLI flags for everything. This is why almost every major solution at some point starts offering filebased configuration instead of args, because args are hard to document and control/audit. But I see that we have trouble arguing for the value add here as well so let me push that to a separate PR so as not to block this effort here further.

With the above said:

  1. Lets just add global defaults for now without config file like you suggested to appeal to your request to scope down the PR
  2. Lets talk about custom fs options+minimum sizes in a separate issue?

jakobmoellerdev avatar Mar 22 '24 06:03 jakobmoellerdev

I was focusing on only avoiding errors in mkfs by adding a minimum allocation size on this PR. Therefore, I did not consider extensions about other options for mkfs. There may have been a misunderstanding there.

  1. Lets just add global defaults for now without config file like you suggested to appeal to your request to scope down the PR
  2. Lets talk about custom fs options+minimum sizes in a separate issue?

I agree with this suggestion. Since the design of mkfs options other than minimum allocation size will be complicated by referring to your examples, it would be better to make a proposal in a separate PR and discuss it once. I believe that the default values we are adding in this PR and the new options we are going to consider in the proposal are not in conflict. @satoru-takeuchi How do you think?

llamerada-jp avatar Mar 25 '24 08:03 llamerada-jp

Sorry for the late reply. I was on paid leave yesterday and the day before yesterday.

@llamerada-jp I agree with you.

@jakobmoellerdev Please update this PR in accordance with the following description.

With the above said:

  1. Lets just add global defaults for now without config file like you suggested to appeal to your request to scope down the PR
  2. Lets talk about custom fs options+minimum sizes in a separate issue?

FYI, in general, I agree with the following your opinion.

The ConfigMap (at least in my opinion) is a much better way to configure an application than using CLI flags for everything. This is why almost every major solution at some point starts offering filebased configuration instead of args, because args are hard to document and control/audit. But I see that we have trouble arguing for the value add here as well so let me push that to a separate PR so as not to block this effort here further.

We afraid to introduce ConfigMap just for edge cases. I'm glad if the design proposal of "ConfigMap-based custom fs options+minimum sizes" is also beneficial for non-edge users.

satoru-takeuchi avatar Mar 27 '24 07:03 satoru-takeuchi