thin-provisioning-tools Support defragmenting and shrinking thin pools

I would like to be able to defragment and shrink thin pools. This would be very useful for desktop use-cases such as Qubes OS. This should work even if the thin pool is overprovisioned, so long as there is enough space available.

Apr 03 '22 19:04 DemiMarie

There's an ongoing feature thin_shrink written in Rust, which remaps data blocks beyond the new size to the free blocks before. Not sure that's what you're looking for.

The tool produces an intermediate XML, so an extra thin_restore is required to produce the final metadata.

# thin_dump /dev/vg1/tp1_tmeta -o tmeta.xml
# thin_shrink --input tmeta.xml --data /dev/vg1/tp1_tdata --nr-blocks 1024 --output new.xml
# thin_restore -i new.xml -o /dev/vg1/new_tmeta

Further optimizations will be made, e.g., support shared mappings.

Apr 04 '22 08:04 mingnus

There's an ongoing feature thin_shrink written in Rust, which remaps data blocks beyond the new size to the free blocks before. Not sure that's what you're looking for.

The tool produces an intermediate XML, so an extra thin_restore is required to produce the final metadata.
# thin_dump /dev/vg1/tp1_tmeta -o tmeta.xml
# thin_shrink --input tmeta.xml --data /dev/vg1/tp1_tdata --nr-blocks 1024 --output new.xml
# thin_restore -i new.xml -o /dev/vg1/new_tmeta
Further optimizations will be made, e.g., support shared mappings.

Will this also handle moving the data blocks to their new locations in a crash-safe way?

Apr 04 '22 16:04 DemiMarie

Yes, the tool works offline and the data is moved to unused blocks, so the original data is untouched. You'll need to start over again if the program is terminated unexpectedly.

Apr 05 '22 05:04 mingnus

Yes, the tool works offline and the data is moved to unused blocks, so the original data is untouched. You'll need to start over again if the program is terminated unexpectedly.

Nice! A thin_defrag tool that actually made the original data contiguous would be nice, but I suspect making it safe would be much more work.

Apr 05 '22 08:04 DemiMarie

Defragmentation might involve multiple transactions if a destination block is occupied, which is beyond the scope of the tools. Second, defragment might not be quite useful due to the copy-on-write nature of dm-thin. Would you want to defragment the volumes for performance reasons?

Apr 06 '22 02:04 mingnus

Defragmentation might involve multiple transactions if a destination block is occupied, which is beyond the scope of the tools.

This is somewhat disappointing. I was hoping for a tool that automated the entire process, including as many transactions as are needed. This is slightly riskier in that there is no way for a human to review the intermediate metadata, but in practice that ability is only useful for debugging. A fully automated process is a much better user experience and is more consistent with filesystem tools. It is also necessary for integration into lvm pvresize.

Second, defragment might not be quite useful due to the copy-on-write nature of dm-thin. Would you want to defragment the volumes for performance reasons?

Yes, especially on spinning drives. Right now dm-thin on spinning rust has intolerably poor performance, even with a battery-backed RAID controller.

As an aside, one area where I would like to understand dm-thin is its performance in allocation-heavy workloads, such as VM image building. In Qubes OS, by default, almost every volume has at least one snapshot or is newly created, which means that the only writes that do not break CoW are those made to blocks that were already written at least once since boot. When building VM images, I believe this is a minority of writes. Therefore, being able to quickly allocate blocks is incredibly important, and I am not sure how well dm-thin currently handles this.

Apr 06 '22 03:04 DemiMarie

The big weakness of dm-thin is it's fixed block size. This leads to write amplification when provisioning new blocks if block zeroing is turned on, or when breaking sharing for snapshots when the io is smaller than a block.

To that end we're currently working on a new iteration of thin that uses variable size blocks and will perform much better. I'm also not considering enhancements to the current kernel driver at this point. Eg live defrag would require a way of punching in mappings from users pace.

Defrag is a difficult problem. If you have several snapshots sharing data blocks, then it's impossible to have them all contiguous. So a defrag tool will need more input from the admin, eg. Prioritise this snapshot. Given the amount of copying that defrag involves I do think that dm-archive will be able to provide a good solution. Eg, restoring to an empty thin will result in large contiguous runs, restoring into a snapshot will result in data sharing where possible. By selecting your restoration order you can define your priorities.

On Wed, 6 Apr 2022, 04:34 Demi Marie Obenour, @.***> wrote:

Defragmentation might involve multiple transactions if a destination block is occupied, which is beyond the scope of the tools.

This is somewhat disappointing. I was hoping for a tool that automated the entire process, including as many transactions as are needed. This is slightly riskier in that there is no way for a human to review the intermediate metadata, but in practice that ability is only useful for debugging. A fully automated process is a much better user experience and is more consistent with filesystem tools. It is also necessary for integration into lvm pvresize.

Second, defragment might not be quite useful due to the copy-on-write nature of dm-thin. Would you want to defragment the volumes for performance reasons?

Yes, especially on spinning drives. Right now dm-thin on spinning rust has intolerably poor performance, even with a battery-backed RAID controller.

As an aside, one area where I would like to understand dm-thin is its performance in allocation-heavy workloads, such as VM image building. In Qubes OS, by default, almost every volume has at least one snapshot or is newly created, which means that the only writes that do not break CoW are those made to blocks that were already written at least once since boot. When building VM images, I believe this is a minority of writes. Therefore, being able to quickly allocate blocks is incredibly important, and I am not sure how well dm-thin currently handles this.

— Reply to this email directly, view it on GitHub https://github.com/jthornber/thin-provisioning-tools/issues/205#issuecomment-1089750417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABOSQYOOZQ4OLBADAJ57Z3VDUA4ZANCNFSM5SNTUCOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Apr 06 '22 07:04 jthornber

To that end we're currently working on a new iteration of thin that uses variable size blocks and will perform much better.

Would it be possible to also fix the 2^24 limit on the total number of snapshots over the life of a pool?

Apr 06 '22 17:04 DemiMarie

Yep

On Wed, 6 Apr 2022, 18:52 Demi Marie Obenour, @.***> wrote:

To that end we're currently working on a new iteration of thin that uses variable size blocks and will perform much better.

Would it be possible to also fix the 2^24 limit on the total number of snapshots over the life of a pool https://github.com/QubesOS/qubes-issues/issues/3244?

— Reply to this email directly, view it on GitHub https://github.com/jthornber/thin-provisioning-tools/issues/205#issuecomment-1090560401, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABOSQ6T5X2Z4RRSBKKMCNDVDXFL3ANCNFSM5SNTUCOA . You are receiving this because you commented.Message ID: @.***>

Apr 06 '22 17:04 jthornber

Will it be possible to migrate old pools to the new format? Also, I must say that being able to use a 256-bit hash as the thin device ID would make userspace dev’s lives much simpler.

Apr 06 '22 18:04 DemiMarie

To that end we're currently working on a new iteration of thin that uses variable size blocks and will perform much better. I'm also not considering enhancements to the current kernel driver at this point. Eg live defrag would require a way of punching in mappings from users pace.

Is the source to this available anywhere?

Jun 23 '22 18:06 DemiMarie

no

On Thu, 23 Jun 2022 at 19:59, Demi Marie Obenour @.***> wrote:

To that end we're currently working on a new iteration of thin that uses variable size blocks and will perform much better. I'm also not considering enhancements to the current kernel driver at this point. Eg live defrag would require a way of punching in mappings from users pace.

Is the source to this available anywhere?

— Reply to this email directly, view it on GitHub https://github.com/jthornber/thin-provisioning-tools/issues/205#issuecomment-1164763229, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABOSQ5DB67ZELNCFUPJS4TVQSX2BANCNFSM5SNTUCOA . You are receiving this because you commented.Message ID: @.***>

Jun 23 '22 19:06 jthornber

no

Do you have any idea when it will be made available (e.g. on dm-devel)?

Jun 23 '22 19:06 DemiMarie

It needs about 3 months more work I estimate. But other projects have priority atm, so I doubt I'll get back to it until the autumn.

On Thu, 23 Jun 2022 at 20:11, Demi Marie Obenour @.***> wrote:

no

Do you have any idea when it will be made available (e.g. on dm-devel)?

— Reply to this email directly, view it on GitHub https://github.com/jthornber/thin-provisioning-tools/issues/205#issuecomment-1164773370, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABOSQYZA2PIBB6RLJV25ODVQSZGZANCNFSM5SNTUCOA . You are receiving this because you commented.Message ID: @.***>

Jun 23 '22 19:06 jthornber

Which of the following (if any) do you plan to support in dm-thin v2?

User-provided per-thin metadata
Being able to assign names to thin volumes, instead of numbers
Migration of existing dm-thin v1 pools
Defragmentation (can be userspace-driven; online is preferable)
Deduplication (ditto)
Shrinking
Merging of external-origin snapshots
Per-thin space reservation a la ZFS zvols
- This is extremely useful for desktop use-cases, such as Qubes OS, where expandable storage is a rarity. It is also a highly-requested feature on [email protected].
Fast, reliable discards
Automatic discard of deleted volumes

Some features that are admittedly a stretch, but would make dm-thin v2 more competitive against ZFS and bcachefs:

Data checksums (stored in metadata)
Per-thin encryption of newly written data
Always-CoW mode (necessary for checksums and for zoned device support)

Jun 23 '22 20:06 DemiMarie

To be clear: this is a big list and I do not expect you to add support for every one of these features, especially in the initial patchset. That said, I would like to know for planning purposes.

Jun 26 '22 14:06 DemiMarie

On Thu, 23 Jun 2022 at 21:06, Demi Marie Obenour @.***> wrote:

Which of the following (if any) do you plan to support in dm-thin v2?

User-provided per-thin metadata

No

Being able to assign names to thin volumes, instead of numbers

No

Migration of existing dm-thin v1 pools

Yes

Defragmentation (can be userspace-driven; online is preferable)

Yes

Deduplication (ditto)

No, live dedup is silly.

Shrinking

Maybe. We already have offline shrinking. Solving defrag probably solves this too.

Merging of external-origin snapshots

I think layering the dm-mirror target on top of the external origin and thin snap and saying the external origin was out of sync would already solve this.

Per-thin space reservation a la ZFS zvols

This is extremely useful for desktop use-cases, such as Qubes OS, where expandable storage is a rarity. It is also a highly-requested feature on @.***

We could do this easily. I need to know exactly what your requirements are here though.

Fast, reliable discards

We aren't using fixed block sizes any more, which means discards will always free up some space (whereas discards of partial blocks didn't).

Automatic discard of deleted volumes

Why can't you discard the volume before you delete it? Discarding a whole volume is a long process, I'd much rather this was done by a userland process rather than a kernel task.

Some features that are admittedly a stretch, but would make dm-thin v2 more competitive against ZFS and bcachefs:

Data checksums (stored in metadata)

Layer dm-veritas?

Per-thin encryption of newly written data

Layer dm-crypt?

Always-CoW mode (necessary for checksums and for zoned device support)

possibly, this has come up as a possible way to do defrag.

— Reply to this email directly, view it on GitHub https://github.com/jthornber/thin-provisioning-tools/issues/205#issuecomment-1164819314, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABOSQ5JYMH3THTGH67PRDDVQS7UVANCNFSM5SNTUCOA . You are receiving this because you commented.Message ID: @.***>

Oct 11 '22 07:10 jthornber

Which of the following (if any) do you plan to support in dm-thin v2?

User-provided per-thin metadata

No

Understandable, if disappointing. Being able to add even just 32 bytes would make a thinsetup tool practical.

Being able to assign names to thin volumes, instead of numbers

No

Migration of existing dm-thin v1 pools

Yes

Nice!

Defragmentation (can be userspace-driven; online is preferable)

Yes

Nice!

Deduplication (ditto)

No, live dedup is silly.

To elaborate: I am mostly referring to deduplication of not-in-use volumes on an in-use pool. A user might have 10 different VM images with very similar content, and wish to deduplicate them without having to take the entire pool offline.

Shrinking

Maybe. We already have offline shrinking. Solving defrag probably solves this too.

I agree; shrinking is basically a special case of defragmentation.

Merging of external-origin snapshots

I think layering the dm-mirror target on top of the external origin and thin snap and saying the external origin was out of sync would already solve this.

That makes this a documentation issue and possibly an lvm2 issue.

Per-thin space reservation a la ZFS zvols

This is extremely useful for desktop use-cases, such as Qubes OS, where expandable storage is a rarity. It is also a highly-requested feature on [email protected].

We could do this easily. I need to know exactly what your requirements are here though.

In ZFS, each zvol has an associated reservation, which represents the amount of space that the volume is guaranteed to be able to store. By default, a writable volume’s reservation is the entire size of the volume, while a read-only snapshot only reserves the space it is currently using. Volumes are not allowed to encroach on the reserved space of other volumes, and ZFS will fail writes with -ENOSPC rather than violate this guarantee.

This could be used to ensure that e.g. the root filesystem and the one with user home directories are thickly provisioned, while VM volumes on the same pool are thinly provisioned.

Fast, reliable discards

We aren't using fixed block sizes any more, which means discards will always free up some space (whereas discards of partial blocks didn't).

That is definitely a win! Will discards still block other I/O on the pool? Will they be faster than they are right now?

Automatic discard of deleted volumes

Why can't you discard the volume before you delete it? Discarding a whole volume is a long process, I'd much rather this was done by a userland process rather than a kernel task.

Qubes OS already tries to do this, but it is slow and requires userspace to keep additional state (whether discard completed successfully).

Some features that are admittedly a stretch, but would make dm-thin v2 more competitive against ZFS and bcachefs:

Data checksums (stored in metadata)

Layer dm-veritas?

If you mean dm-verity, that is read-only. dm-X is read/write, but it is still out of tree. dm-integrity is in-tree, but it is slow (due to write journaling; the bitmap mode provides weaker protection) and does not protect against e.g. replay attacks. Furthermore, all of these involve additional metadata that is redundant with dm-thin’s own metadata.

Per-thin encryption of newly written data

Layer dm-crypt?

In my use-case (Qubes OS disposable VMs), volume X is a snapshot of volume Y. I need to encrypt data written to volume X, but still be able to read data that was inherited from volume Y. Therefore, choosing the correct decryption key and algorithm requires knowing if the data was inherited from volume Y or newly written to volume X. dm-crypt does not have this information.

Always-CoW mode (necessary for checksums and for zoned device support)

possibly, this has come up as a possible way to do defrag.

Nice!

Oct 12 '22 01:10 DemiMarie

To elaborate regarding discard: I would like blkdiscard on a thin device to take time proportional to the number of unshared blocks, not the total number of provisioned blocks.

Oct 15 '22 16:10 DemiMarie

Deduplication (ditto)

No, live dedup is silly.

To elaborate: I am mostly referring to deduplication of not-in-use volumes on an in-use pool. A user might have 10 different VM images with very similar content, and wish to deduplicate them without having to take the entire pool offline.

@DemiMarie maybe you should take a look at vdo https://github.com/lvmteam/lvm2/blob/master/doc/vdo.md https://github.com/dm-vdo/vdo u can layer things, i tried it some long time ago and it "worked" but certainly don't recommended it for production, let alone critical systems, will get real messy real fast if things go wrong.

i dunno the state of it nowadays, but it seems it got merged into the main lvm code.

Jan 22 '23 02:01 Darkangeel-hd

will get real messy real fast if things go wrong.

In what ways?

Jan 22 '23 02:01 DemiMarie

will get real messy real fast if things go wrong.

In what ways?

u already know about the problems with fragmentation and where "real" data lays on disk on thin-pools, now add another layer that will scrumble your data on its own way. you also know what when thin-pools gets filled up they fail and corrupt data, vdos can also get full while the virtual disk says it's not, (i dunno the current behavior here)

if the vdo go full thinpool writes start to fail, if you don't act fast enough and this situation continues data gets corrupted and you dont even know which data

when i tried it vdo was still a bit green (you needed to compile it for your kernel and all that) and didnt support layering thin pools over it, so what i did was create another pv inside the lv (need to modify lvm.conf to allow using it) and add it to the same vg and put the thin pool in that pv. i also added come cache layers on each side. in the end you can get into a system as complicated as 8 layers os "things" from disk to filesystem, of which a failure in any will hurt your data.

i did have some problems and ended up filling the vdo device and corrupting data.

haven't tried to mess with vdo since then and I'm just waiting for it to be merged into the main source and get tested and more mature

i don't say it is not a viable option, as it is intended to be layered onto, but i dont know in which state of development it is nowadays.

TL;DR adding more layers (specially early dev stage ones) to the filesystem complicate things, add cpu overhead and another point of failure that you must monitor

for power users that's a choice they can make, but i would make it a default option anywhere till its more tested.

Sorry for the long reply.

Jan 22 '23 06:01 Darkangeel-hd

you also know what when thin-pools gets filled up they fail and corrupt data

Thin pools should not corrupt data, even if they get full. They may lose some writes that were not properly flushed, but that should not cause filesystem corruption.

What can happen, if I understand correctly, is a situation where a thin pool fills up and there is no way to recover without growing the thin pool.

Jan 22 '23 17:01 DemiMarie

My bad i meant when the metadata gets full as far as i know there was no separation of data and metadata on vdo backing lv

but yes eventually if you keep filling in pending io operations mem will get full and system posibly freeze, which gets us into a nasty state to recover from

Im no expert in lvm, i just wanted to let you know that that there are already works on deduplication on lvm so you can do your own research instead of just taking my for sure outdated info for granted.

i found this repo and issue looking for a solution in defragmenting thin pools, mostly to re-gain a bit of performance in hdd backed ones, shrinking then would be awesome too. by my coding skills right now are in no way near the minimum required for helping in such projects, sadly.

hope that you can make something useful of what little info I presented you.

Jan 22 '23 17:01 Darkangeel-hd

In addition, last time I checked, vdo does not and will not support passing trim/discard to lower levels.

https://github.com/dm-vdo/kvdo/issues/12

B

Jan 22 '23 18:01 brendanhoar

thin-provisioning-tools thin-provisioning-tools copied to clipboard

Support defragmenting and shrinking thin pools

thin-provisioning-tools
thin-provisioning-tools copied to clipboard