zos bcachefs assessment
Assess how we can use bcachefs on zos
related issues:
- #2074
- #2229
- #2374
Is your feature request related to a problem? Please describe
Why we need to move out from btrfs:
- ...
Why bcachefs:
- improve performance of the HDD by employing SSD as the internal cache
- with HDD improved performance -> more HDD usage -> more economic
- ....
scope:
bcachefswill only be used for the workload, because.....- not using LVM
Describe the solution you'd like
The assessment will be done in two phases
- backward compatibility check
We do this check because we need to know how btrfs is currently used in zos for these reasons:
- seeing how
btrfsfeatures used in currentzoscan help us to design thebcachefsusage. examples: how subvolume limit & usage being used, hownocowfile attribute is currently used - i also new to
zos, need to understand the full flow that relates to the use ofbtrfsusage
For things that are compatible: good For non compatible things:
- check if we really need it
- if needed, how we work around that
- plan/specs to use
bcachefsonzos
- it doesn't need to be backward compatible
- employ multi device filesystem features of
bcachefs - one idea is we create one partition for each VM, this way we won't have issue with quota but another trade-off might come as briefly mentioned by Azmy at https://github.com/threefoldtech/zos/issues/2074#issuecomment-2096058113
cc @delandtj
backward compatibility check
This check involves the work on porting current btrfs code to bcachefs, it is WIP in #2375
Deep diving the code is expected to give more understanding, although function call is not always obvious because of zbus usage. (zbus is a good thing, we only need to be more throughout when tracing the call flow)
No support for subvolume limit limit/quota
what we really need:
- set the usage limit of the allocated subvolume/workload
how subvolume limit used:
a. set limit on zos cache: no issue here, we will keep it on btrfs
b. when creating volume for a container
- created by calling
VolumeCreateduring container creation - the volume will be used as overlay mount on top the provided flist https://github.com/threefoldtech/zos/blob/0ea61706e1a501d4e774a9195c139e2995bdd1cb/pkg/flist/flist.go#L457
possible solutions:
- use disk image instead of volume, but it will be slower
lvmis not a choice for us -> ..... need explanation ....- use one partition for each VM/container: it is quite hard to manage lot of partitions
- use stratis to manage, but need add support for
bcachefs: possibly a lot of works? - does
bcachefshas a plan to support subvolume quota? (looks like not) usrquota,prjquota,grpquotainbcachefs
c. on VolumeUpdate https://github.com/threefoldtech/zos/blob/0ea61706e1a501d4e774a9195c139e2995bdd1cb/pkg/primitives/volume/volume.go#L98
it is used by:
d. on pkg/flisthttps://github.com/threefoldtech/zos/blob/0ea61706e1a501d4e774a9195c139e2995bdd1cb/pkg/flist/flist.go#L472
it is used by qsfsd when ....
No support for FS_NOCOW_FL file attribute
what we really need:
- set disk image file as
nocow, it supposed to have better performance
possible solution:
- leave it as
cow, it is less performance but not that much (need proof?)
No subvolume info command
what we really need:
- to know subvolume disk usage
possible solutions:
we don't really need it. Subvolume disk usage only really needed when there is no limit on the subvolume. And the only occurence for this is when we create zdb cache.
zdb cache disk usage is counted using it's own method.
current lsblk doesn't have bcachefs support
what we really need: Get disk label/fstype on startup
** solution ** Maxus will upgrade it
Specification
The new bcachefs based storage must provide all the features provided by the btrfs based storage.
Backward compatibility
Because all disk of the old nodes already formatted with btrfs, we only support new nodes
bcachefs only for the workloads
Root filesystem still use btrfs with it's /var/run/cache
multidevice filesystem strategy
bcachefs supports a real pool, where multiple devices can be formatted into a single filesystem:
- a filesystem is created from SSD(s) and HDD(s)
caching
writeback caching:
- write to the fast device (SSD)
- background worker periodically move data from the fast device to the slow device
- when reading, the data will be copied to the fast device if not there
config
--foreground_target=ssd
--background_target=hdd
--promote_target=ssd
quota management
- no limit/quota for now
the language (Rust or Go)
Rust is the way to go, but the prototype can be build using Go
mkfs.bcachefs also has this option, worth to check
--usrquota Enable user quotas
--grpquota Enable group quotas
--prjquota Enable project quotas
There was drama on LKML about bcachefs https://www.phoronix.com/news/Bcachefs-Fixes-Two-Choices.
Or "take your toy and go home" effectively alluding to taking it out of the mainline Linux kernel and go back to developing it out-of-tree.
The risk is that bcachefs could be out of mainline kernel.
So, we observe and see for now.