zos icon indicating copy to clipboard operation
zos copied to clipboard

zos bcachefs assessment

Open iwanbk opened this issue 1 year ago • 4 comments

Assess how we can use bcachefs on zos

related issues:

  • #2074
  • #2229
  • #2374

Is your feature request related to a problem? Please describe

Why we need to move out from btrfs:

  • ...

Why bcachefs:

  • improve performance of the HDD by employing SSD as the internal cache
  • with HDD improved performance -> more HDD usage -> more economic
  • ....

scope:

  • bcachefs will only be used for the workload, because.....
  • not using LVM

Describe the solution you'd like

The assessment will be done in two phases

  1. backward compatibility check

We do this check because we need to know how btrfs is currently used in zos for these reasons:

  • seeing how btrfs features used in current zos can help us to design the bcachefs usage. examples: how subvolume limit & usage being used, how nocow file attribute is currently used
  • i also new to zos, need to understand the full flow that relates to the use of btrfs usage

For things that are compatible: good For non compatible things:

  • check if we really need it
  • if needed, how we work around that
  1. plan/specs to use bcachefs on zos
  • it doesn't need to be backward compatible
  • employ multi device filesystem features of bcachefs
  • one idea is we create one partition for each VM, this way we won't have issue with quota but another trade-off might come as briefly mentioned by Azmy at https://github.com/threefoldtech/zos/issues/2074#issuecomment-2096058113

cc @delandtj

iwanbk avatar Aug 14 '24 03:08 iwanbk

backward compatibility check

This check involves the work on porting current btrfs code to bcachefs, it is WIP in #2375 Deep diving the code is expected to give more understanding, although function call is not always obvious because of zbus usage. (zbus is a good thing, we only need to be more throughout when tracing the call flow)

No support for subvolume limit limit/quota

what we really need:

  • set the usage limit of the allocated subvolume/workload

how subvolume limit used: a. set limit on zos cache: no issue here, we will keep it on btrfs

b. when creating volume for a container

  • created by calling VolumeCreate during container creation
  • the volume will be used as overlay mount on top the provided flist https://github.com/threefoldtech/zos/blob/0ea61706e1a501d4e774a9195c139e2995bdd1cb/pkg/flist/flist.go#L457

possible solutions:

  • use disk image instead of volume, but it will be slower
  • lvm is not a choice for us -> ..... need explanation ....
  • use one partition for each VM/container: it is quite hard to manage lot of partitions
  • use stratis to manage, but need add support for bcachefs: possibly a lot of works?
  • does bcachefs has a plan to support subvolume quota? (looks like not)
  • usrquota, prjquota, grpquota in bcachefs

c. on VolumeUpdate https://github.com/threefoldtech/zos/blob/0ea61706e1a501d4e774a9195c139e2995bdd1cb/pkg/primitives/volume/volume.go#L98 it is used by:

d. on pkg/flisthttps://github.com/threefoldtech/zos/blob/0ea61706e1a501d4e774a9195c139e2995bdd1cb/pkg/flist/flist.go#L472 it is used by qsfsd when ....

No support for FS_NOCOW_FL file attribute

what we really need:

  • set disk image file as nocow, it supposed to have better performance

possible solution:

  • leave it as cow, it is less performance but not that much (need proof?)

No subvolume info command

what we really need:

  • to know subvolume disk usage

possible solutions: we don't really need it. Subvolume disk usage only really needed when there is no limit on the subvolume. And the only occurence for this is when we create zdb cache. zdb cache disk usage is counted using it's own method.

current lsblk doesn't have bcachefs support

what we really need: Get disk label/fstype on startup

** solution ** Maxus will upgrade it

iwanbk avatar Aug 14 '24 04:08 iwanbk

Specification

The new bcachefs based storage must provide all the features provided by the btrfs based storage.

Backward compatibility

Because all disk of the old nodes already formatted with btrfs, we only support new nodes

bcachefs only for the workloads

Root filesystem still use btrfs with it's /var/run/cache

multidevice filesystem strategy

bcachefs supports a real pool, where multiple devices can be formatted into a single filesystem:

  • a filesystem is created from SSD(s) and HDD(s)

caching

writeback caching:

  • write to the fast device (SSD)
  • background worker periodically move data from the fast device to the slow device
  • when reading, the data will be copied to the fast device if not there

config

--foreground_target=ssd
--background_target=hdd
--promote_target=ssd

quota management

  • no limit/quota for now

the language (Rust or Go)

Rust is the way to go, but the prototype can be build using Go

iwanbk avatar Aug 15 '24 05:08 iwanbk

mkfs.bcachefs also has this option, worth to check

--usrquota              Enable user quotas
--grpquota              Enable group quotas
--prjquota              Enable project quotas

iwanbk avatar Aug 15 '24 11:08 iwanbk

There was drama on LKML about bcachefs https://www.phoronix.com/news/Bcachefs-Fixes-Two-Choices.

Or "take your toy and go home" effectively alluding to taking it out of the mainline Linux kernel and go back to developing it out-of-tree.

The risk is that bcachefs could be out of mainline kernel. So, we observe and see for now.

iwanbk avatar Oct 28 '24 08:10 iwanbk