qubes-issues
qubes-issues copied to clipboard
Switch default pool from LVM to BTRFS-Reflink
The problem you're addressing (if any)
In R4.0, the default install uses LVM thin pools. However, LVM appears to be optimized for servers, which results in several shortcomings:
- Space exhaustion is handled poorly, requiring manual recovery. This recovery may sometimes fail.
- It is not possible to shrink a thin pool.
- Thin pools slow down system startup and shutdown.
Additionally, LVM thin pools do not support checksums. This can be achieved via dm-integrity, but that does not support TRIM.
Describe the solution you'd like
I propose that R4.3 use BTRFS+reflinks by default. This is a proposal ― it is by no means finalized.
Where is the value to a user, and who might that user be?
BTRFS has checksums by default, and has full support for TRIM. It is also possible to shrink a BTRFS pool without a full backup+restore. BTRFS does not slow down system startup and shutdown, and does not corrupt data if metadata space is exhausted.
When combined with LUKS, BTRFS checksumming provides authentication: it is not possible to tamper with the on-disk data (except by rolling back to a previous version) without invalidating the checksum. Therefore, this is a first step towards untrusted storage domains. Furthermore, BTRFS is the default in Fedora 33 and openSUSE.
Finally, with BTRFS, VM images are just ordinary disk files, and the storage pool the same as the dom0 filesystem. This means that issues like #6297 are impossible.
Describe alternatives you've considered
None that are currently practical. bcachefs and ZFS are long-term potential alternatives, but the latter would need to be distributed as source and the former is not production-ready yet.
Additional context
I have had to recover manually from LVM thin pool problems (failure to activate, IIRC) on more than one occasion. Additionally, the only supported interface to LVM is the CLI, which is rather clumsy. The LVM pool requires nearly twice the amount of code as the BTRFS pool, for example.
Relevant documentation you've consulted
man lvm
Related, non-duplicate issues
#5053 #6297 #6184 #3244 (really a kernel bug) #5826 #3230 ― since reflink files are ordinary disk files we could just rename them without needing a copy #3964 everything in https://github.com/QubesOS/qubes-issues/search?q=lvm+thin+pool&state=open&type=issues
Most recent benchmarks: https://github.com/QubesOS/qubes-issues/issues/6476#issuecomment-1689640103
It might be a good idea to compare performance (seq read, rand read, allocation, overwrite, discard) between the three backends. See: #3639
With regard to VM boot time, LVM storage pool was slightly faster than BTRFS, but this may be still within the margin of error (LVM: 7.43 s versus BTRFS: 8.15 s for starting a debian-10-minimal VM).
Marking as RFC because this is by no means finalized.
@DemiMarie following comment I'm posting deconstructed thoughts here.
No problem with QubesOS searching the best FS to switch for on 4.1 release, and questioning partition scheme, but i'm a bit lost on the direction of QubesOS 4.1 and the goals here (stability? performance? backups? portability? security?)
I was kind of against having dom0 having seperate LVM pool for space constrains resulting of the change, but agreed and accepted that the pool metadata exhaustion possibility was a real tangible issue that hit me a lot before, for which resolution is sketchy and still not advertised in widget correctly for users simply upgrading and being hit with.
The fix in new install resolved the issue, while QubesOS decided to split the dom0 pool out of main pool, so fixing pool issues on the system would be more easy for the end user or non existent.
I am just not so sure why switching filesystem is on point now, where LVM thin provisioning seems to fit the goal, but willing to hear more about the advantages.
I am interested into the reasoning for such a switch, and the probabilities of doing so, since I am really interested into pushing wyng-backups farther, inside/outside of Heads inside/outside of QubesOS, of grant/self funding the work so that QubesOS metadata would be included in wyng-backups, permitting restore/verification/fresh deployment/revert from local(oem recovery VM)/remote source, just applying diff where required from ssh remote red only mountpoint.
This filesystem choice seems to be less relevant then what can make those changes consume dom0 LVM which should be excludedof dom0 so that dmverity can be setuped under Heads/Safeboot. But this is irrelevant to this ticket.
I am just not so sure why switching filesystem is on point now, where LVM thin provisioning seems to fit the goal, but willing to hear more about the advantages.
The advantages are listed above. In short, a BTRFS pool is more flexible, and it offers possibilities (such as whole-system snapshots) that I do not believe are possible with LVM thin provisioning. BTRFS also offers flexible quotas, and can always recover from out of space conditions provided that a small amount of additional storage (such as a spare partition set aside for the purpose) is available. Furthermore, BTRFS checksumming and scrubbing appear to be useful. Finally, new storage can be added to and removed from a BTRFS pool at any time, and the pool can be shrunk as well.
BTRFS also has disadvantages: its throughput is worse than LVM, and there are reports of bad performance on I/O heavy workloads such as QubesOS. Benchmarks and user feedback will be needed to determine which is better, which is why this is an RFC.
I am interested into the reasoning for such a switch, and the probabilities of doing so, since I am really interested into pushing wyng-backups farther, inside/outside of Heads inside/outside of QubesOS, of grant/self funding the work so that QubesOS metadata would be included in wyng-backups, permitting restore/verification/fresh deployment/revert from local(oem recovery VM)/remote source, just applying diff where required from ssh remote red only mountpoint.
I believe that btrfs send and btrfs receive offer the same functionality as wyng-backups, but am not certain as I never used either. As far as the probability: this is currently only a proposal, and I am honestly not sure if switching this close to the R4.1 release date is a good idea. In any case, LVM will continue to be fully supported ― this just flips the default in the installer.
@DemiMarie There are many questions swirling around advanced storage on Linux, but I think the main ones applicable here are about reliability and performance. Btrfs and Thin LVM appear to offer trade-offs on those qualities, and I don't think its necessarily a good move to switch the Qubes default for a slower storage scheme at this point; storage speed is critical for Qubes' usability and large disk image files with random write patterns are Btrfs' weakest point.
Running out of space is probably Thin LVM's weakest point, although this can be pretty easily avoided. For one, dom0 root is moving to a dedicated pool in R4.1, which will keep admin working in most situations. Adding more protections to the domU pool can also be done with some pretty simple userland code. (For those who are skeptical, note that this is the general approach taken by Stratis.)
The above mentioned Btrfs checksums is a nice-to-have feature against accidental damage, but it unfortunately does not come close to providing authentication. To my knowledge, no CRC mode can do that even if its encrypted. Any attacker able to induce some calculated change in an encrypted volume would probably find the malleability of encrypted CRCs to be little or no obstacle. IMHO, the authentication aspect of the proposal is a non-starter. (BTW, it looks like dm-integrity may be able to do this now along with discard support, if its journal mode supports internal tags.)
As for backups, Wyng basically exists because tools like btrfs send are constrained to using the same back end (Btrfs with admin privileges) which severely narrows the user's options for backup destinations. Wyng can also be adapted to any storage source that can create snapshots and report their deltas (Btrfs included).
The storage field also continues to evolve in interesting ways: Red Hat is creating Stratis while hardware manufacturers implemented NVMe objects and enhanced parallelism. Stratis appears to be based on none other than Thin LVM's main components (dm-thin, etc) in addition to dm-integrity, with XFS on top; all the layers are tied together to respond cohesively from a single management interface. This is being developed to avoid Btrfs maintenance and performance pitfalls.
I think some examination of Btrfs development culture may also be in order, as it has driven Red Hat to exasperation and a decision to drop Btrfs. I'm not sure just what it is about accepting Btrfs patches that presents a problem, but it makes me concerned that too much trust has been eroded and that Btrfs may become a casualty in 'storage wars' between an IBM / Red Hat camp and what I'd call an Oracle-centric camp.
FWIW, I was one of the first users to show how Qubes could take advantage of Btrfs reflinks for cloning and to request specific reflink support. Back in 2014, it was easy to assume Btrfs shortcomings would be addressed fairly soon, since those issues were so obvious. Yet they are still unresolved today.
My advice at this point is to wait and see – and experiment. There is an unfortunate dearth of comparison tests configured in a way that makes sense; they usually compare Btrfs to bare Ext4, for example, and almost always overlook LVM thin pools. So its mostly apples vs oranges. However, what little benchmarking I've seen of thin LVM suggests a performance advantage vs Btrfs that would be too large to ignore. There are also Btrfs modes of use we should explore, such as any performance gain from disabling CoW on disk images; if this were deemed desirable then the Qubes Btrfs driver would have to be refactored to use subvolume snapshots instead of reflinks. An XFS reflink comparison on Qubes would also be very interesting!
@DemiMarie There are many questions swirling around advanced storage on Linux, but I think the main ones applicable here are about reliability and performance. Btrfs and Thin LVM appear to offer trade-offs on those qualities, and I don't think its necessarily a good move to switch the Qubes default for a slower storage scheme at this point; storage speed is critical for Qubes' usability and large disk image files with random write patterns are Btrfs' weakest point.
In retrospect, I agree. That said (as you yourself mention below) XFS also supports reflinks and lacks this problem.
Running out of space is probably Thin LVM's weakest point, although this can be pretty easily avoided. For one, dom0 root is moving to a dedicated pool in R4.1, which will keep admin working in most situations. Adding more protections to the domU pool can also be done with some pretty simple userland code. (For those who are skeptical, note that this is the general approach taken by Stratis.)
Will it be possible to reserve space for use by discards? A user needs to be able to free up space even if they make a mistake and let the pool fill up.
The above mentioned Btrfs checksums is a nice-to-have feature against accidental damage, but it unfortunately does not come close to providing authentication. To my knowledge, no CRC mode can do that even if its encrypted. Any attacker able to induce some calculated change in an encrypted volume would probably find the malleability of encrypted CRCs to be little or no obstacle. IMHO, the authentication aspect of the proposal is a non-starter. (BTW, it looks like dm-integrity may be able to do this now along with
discardsupport, if its journal mode supports internal tags.)
The way XTS works is that any change (by an attacker who does not have the key) will completely scramble a 128-bit block; my understanding is that a CRC32 with a scrambled block will only pass with probability 2⁻³². That said, BTRFS also supports Blake2b and SHA256, which would be better choices.
As for backups, Wyng basically exists because tools like
btrfs sendare constrained to using the same back end (Btrfs with admin privileges) which severely narrows the user's options for backup destinations. Wyng can also be adapted to any storage source that can create snapshots and report their deltas (Btrfs included).
Good to know, thanks!
The storage field also continues to evolve in interesting ways: Red Hat is creating Stratis while hardware manufacturers implemented NVMe objects and enhanced parallelism. Stratis appears to be based on none other than Thin LVM's main components (dm-thin, etc) in addition to dm-integrity, with XFS on top; all the layers are tied together to respond cohesively from a single management interface. This is being developed to avoid Btrfs maintenance and performance pitfalls.
I think some examination of Btrfs development culture may also be in order, as it has driven Red Hat to exasperation and a decision to drop Btrfs. I'm not sure just what it is about accepting Btrfs patches that presents a problem, but it makes me concerned that too much trust has been eroded and that Btrfs may become a casualty in 'storage wars' between an IBM / Red Hat camp and what I'd call an Oracle-centric camp.
My understanding (which admittedly comes from a comment on Y Combinator) is that BTRFS moves too fast to be used in RHEL. RHEL is stuck on one kernel for an entire release, and rebasing BTRFS every release became too difficult, especially since Red Hat has no BTRFS developers.
FWIW, I was one of the first users to show how Qubes could take advantage of Btrfs reflinks for cloning and to request specific reflink support. Back in 2014, it was easy to assume Btrfs shortcomings would be addressed fairly soon, since those issues were so obvious. Yet they are still unresolved today.
My advice at this point is to wait and see – and experiment. There is an unfortunate dearth of comparison tests configured in a way that makes sense; they usually compare Btrfs to bare Ext4, for example, and almost always overlook LVM thin pools. So its mostly apples vs oranges. However, what little benchmarking I've seen of thin LVM suggests a performance advantage vs Btrfs that would be too large to ignore. There are also Btrfs modes of use we should explore, such as any performance gain from disabling CoW on disk images; if this were deemed desirable then the Qubes Btrfs driver would have to be refactored to use subvolume snapshots instead of reflinks. An XFS reflink comparison on Qubes would also be very interesting!
That it would be, especially when combined with Stratis. The other major problem with LVM2 (and possibly dm-thin) seems to be snapshot and discard speeds; I expect XFS reflinks to mitigate most of those problems.
Ah, new Btrfs feature... Great! I'd consider enabling one of its hashing modes as being able to support authentication.
I'd still consider the Stratis concept to be more interesting for now, as Qubes' current volume management is pretty similar but potentially even better and simpler due to having a privileged VM environment.
Ah, new Btrfs feature... Great! I'd consider enabling one of its hashing modes as being able to support authentication.
Agreed. While I am not aware of any way to tamper with a LUKS partition without invalidating a CRC, Blake2b is by far the better choice.
I'd still consider the Stratis concept to be more interesting for now, as Qubes' current volume management is pretty similar but potentially even better and simpler due to having a privileged VM environment.
I agree, with one caveat: my understanding is that LUKS/AES-XTS-512 + BTRFS/Blake2b-256 is sufficient to protect against even malicious block devices, whereas dm-integrity is not. dm-integrity is vulnerable to a partial rollback attack: it is possible to rollback parts of the disk without dm-integrity detecting it. Therefore, dm-integrity is not (currently) sufficient for use with untrusted storage domains, which is a future goal of QubesOS.
@tasket: what are your thoughts on using loop devices? That’s my biggest worry regarding XFS+reflinks, which seems to otherwise be a very good choice for QubesOS. Other approaches exist, of course; for instance, we could modify blkback to handle regular files as well as block devices.
I really wish the FS's name wasn't a misogynistic slur. That aside, my only experience with it, under 4.0, had my Qubes installation become unbootable, and I found it very difficult to fix, relative to a system built on LVM. And that does strike as relevant to the question whether Qubes switches, while imo this is only partly addressable via improving the documentation (since the other part is the software we have to use to restore).
FS's name wasn't a misogynistic slur
@0spinboson would you mind clarifying which filesystem you are referring to?
Will it be possible to reserve space for use by discards? A user needs to be able to free up space even if they make a mistake and let the pool fill up.
Yes, its simple to allocate some space in a pool using a non-zero thin lv. Just reserve the lv name in the system, make it inactive, and check that it exists on startup.
Further, it would be easy to use existing space-monitoring components to also pause any VMs associated with a nearly-full pool and then show an alert dialog to the user.
it is possible to rollback parts of the disk without
dm-integritydetecting it.
I thought the journal mode would prevent that? I don't know it in detail, but something like a hash of the hashes of the last changed blocks, computed with the prior journal entry, would have to be in each journal entry.
what are your thoughts on using loop devices? That’s my biggest worry regarding XFS+reflinks
I forgot they were a factor... its been so long since I've used Qubes in a file-backed mode. But this should be the same for Btrfs, I think.
FWIW, the XFS reflink suggestion was more speculative, along the lines of "What if we benchmark it for accessing disk images and its almost as fast as thin LVM?". The regular XFS vs Ext4 benchmarks I'm seeing suggest it might be possible. Its also not aligned with the Stratis concept, as that is closer to thin LVM with XFS just providing the top layer. (Obviously we can't use Stratis itself unless it supports a mode that accounts for the top layer being controlled by domUs.)
Also FWIW: XFS historically supported a 'subvolume' feature for accessing disk image files, instead of loopdevs. It requires certain IO sched conditions are met before it can be enabled.
FS's name wasn't a misogynistic slur
@0spinboson would you mind clarifying which filesystem you are referring to?
'Butterface', was intentional, afaik.
No, it was not. The file system is named btrfs because it means B-tree FS. That the name is often pronounced with a hilarious word may or may not be seen as a pun, but that is on the beholder's eye.
Basic question: If I install R4.1 with BTRFS by selecting custom, and then using Anaconda to automatically create the Qubes partitions with BTRFS, is that sufficient for the default pool to use BTRFS-Reflink? Or do I have to do something extra for the "Reflink" part?
If I install R4.1 with BTRFS by selecting custom, and then using Anaconda to automatically create the Qubes partitions with BTRFS, is that sufficient for the default pool to use BTRFS-Reflink?
Yes
I don't know if there's a separate issue for this, but possible Btrfs + fscrypt integration in Fedora seem relevant here:
The system by default will be encrypted with an encryption key stored in the TPM and bound to the signatures used to sign the bootloader/kernel/initrd, https://lists.fedoraproject.org/archives/list/[email protected]/thread/LYUABL7F2DENO7YIYVD6TRYLWBMF2CFI/
Seconded!
I don't know if there's a separate issue for this, but possible Btrfs + fscrypt integration in Fedora seem relevant here:
The system by default will be encrypted with an encryption key stored in the TPM and bound to the signatures used to sign the bootloader/kernel/initrd
[If a separate ticket/forum discussion is opened for this, I will move this comment there.]
FWIW, I would want to avoid requiring TPM-stored data encryption keys by default, as it ties the user's data to system hardware that can fail.
The approach does makes some sense in an enterprise setting, primarily ensuring that data on storage devices separated from the machine with the TPM are provably unrecoverable during reuse/e-cycling. Business data is often backed up off enterprise endpoints by tools such as One Drive (e.g. often subsuming the Documents folder on windows in recent deployments), so they usually have the hardware failure risk mitigated via backups by default.
For non-enterprise users, esp. the user audience for Qubes, there should be flexibility in how keys are handled. Storage-subsystem-detached keys are useful for some, but the user must make the choices on privacy/security vs data-loss risk.
B
May also explain why the default metadata volume sizes seem insufficient.
Aside from performance, I'd have to say my experience with Btrfs has been more stable than with tLVM. When something does go wrong with Btrfs, its easier to diagnose and recover.
There is also the problem of extra wear from write-amplification. When I look at stats other people have posted for the nvme models I'm using, I'm seeing much higher rates of wear-out on my drives (that have had Qubes on tLVM). Compared to people reporting a similar amount of lifetime read-access GB, my own drives are seeing > 3X the wear-indicator values.
Edit: FWIW, any dynamic thin-provisioning system will have to do most of the allocation work that a full filesystem does. tLVM would have turned out better had they started with a filesystem model (like Ext4) and removed the bits that a volume manager didn't use.
As far as I understand currently there is no code for BTRFS support? Or can we just use the file-image based code and patch it to use cp --reflink?
@kalkin There's a newer file-reflink storage driver that's automatically used for the non-default Btrfs installation layout since R4.0.1.
EDIT: answer from Marek under https://github.com/QubesOS/qubes-issues/issues/6476#issuecomment-1689640103 :
In any case, it's way to late for this for R4.2. We may re-test, and re-consider for R4.3.
@andrewdavidwong Release TBD means not planned for 4.2? This ticket title should be updated for 4.3 and performances comparisons for default installs should be taken into consideration as well. It is to be noted that most OSes are moving away of TLVM for XFS/BRTFS.
Some history:
- Fedora 33 swtiched to BRTFS partitioning by default for workstations, having swap and rootfs in different LUKS containers and still default to BRTFS as Fedora 38.
- Ubuntu Lunar Lobster followed steps and decided to switch to BRTFS as well.
- OpenSuse followed step as well.
- Debian still relies on a thin LVM.
Also, ext4 as fixed inodes at time of formatting the partition, as opposed to XFS/BRTFS which are dynamic, which Qubes extends causing issues for some users.
I interested into knowing what Heads should support in the future for space constaints reasons and to prepare for changes. Also, Qubes is first class citizen, but not the only OS deployed. I was wondering what are the directions of QubesOS considering that BRTFS is default since Fedora33.
cross-posts linking to each other:
- https://forum.qubes-os.org/t/btrfs-and-qubes-os/6967/28?u=insurgo
- https://github.com/osresearch/heads/issues/1474
- Speed comparison of LUKS+TLVM+EXT4 vs LUKS+BRTFS: https://forum.qubes-os.org/t/ext4-vs-btrfs-performance-on-qubes-os-installs/13585
@tlaurion BTRFS appears to have significant problems with I/O intensive workloads. There are known problems that can result in unbounded latency spikes.
@tlaurion BTRFS appears to have significant problems with I/O intensive workloads. There are known problems that can result in unbounded latency spikes.
@DemiMarie those were documented? They should be referenced here. From my experience, the benefits are definitely overpowering TLVM. Care to share some I/O intensive workloads examples?
@tlaurion BTRFS appears to have significant problems with I/O intensive workloads. There are known problems that can result in unbounded latency spikes.
@DemiMarie those were documented? They should be referenced here. From my experience, the benefits are definitely overpowering TLVM. Care to share some I/O intensive workloads examples?
@marmarek did the benchmarks. IIRC he found that BTRFS and XFS were not any faster than thin LVM in a workload (Qubes OS openQA tests, IIRC) that should have favored them.