zfs icon indicating copy to clipboard operation
zfs copied to clipboard

FAST-Tracking REFLINK and Offline Deduplication, first for LINUX only

Open jittygitty opened this issue 2 years ago • 23 comments

This current feature proposal, as opposed to similar ones over the years, is focused on fast-tracking only LINUX support first.

"I don't always cp but when I do, I prefer REFLINKs" ( --the most interesting man alive and https://lwn.net/Articles/789623/ )

Previous efforts had tried to imagine methods whereby not just Linux could be served, or discussions were previous to the newer Linux Kernels which seem to have provided ways to achieve this in easier ways on Linux now, at least in the not too archaic kernels.

Of course these are really two feature requests, the most cried for "cp --reflink" feature and the "Offline Deduplication", so they could be in separate issues when serious implementation work starts, but for now I thought it best to include both since they're not that unrelated and could make sense to implement them concurrently.

cp REFLINK support would allow one to simply make multiple copies/clones of FILES (as opposed to datasets) which won't duplicate space taken and if the reflink copies are edited the common data is still deduplicated and space taken is only of the new portion. This feature has already been in Oracle ZFS for some time, as well as I think OCFS2 and more importantly XFS. (I believe Mac OS X has cp reflink at least on its APFS file system.) Perhaps XFS reflink support on Linux can be a model for similar reflink support with openZFS?

https://blogs.oracle.com/linux/post/upcoming-xfs-work-in-linux-v48-v49-and-v410-by-darrick-wong https://blogs.oracle.com/linux/post/xfs-data-block-sharing-reflink

Seems BCACHEFS is working on this too: https://www.patreon.com/posts/status-update-23029978 (If anyone cares, also seems Microsoft's ReFS supports reflinks, and its offline deduplication is reported to beat the others in savings.)

These two features, REFLINK and Offline Dedupe have been described quite well already, one can look at previous discussions:

https://github.com/openzfs/zfs/search?q=reflink&type=issues https://github.com/openzfs/zfs/search?q=offline+deduplication&type=issues https://www.google.com/search?q=reflink+site%3Azfsonlinux.topicbox.com https://zfs-discuss.zfsonlinux.narkive.com/lbGriLhS/offline-dedup https://github.com/openzfs/zfs/issues/3020 https://github.com/openzfs/zfs/issues/3013 https://github.com/openzfs/zfs/issues/12555 https://github.com/openzfs/zfs/issues/10552 https://github.com/openzfs/zfs/issues/405 https://www.google.com/search?q=zfs+reflink+site%3Aphoronix.com https://www.google.com/search?q=zfs+reflink https://www.google.com/search?q=zfs+offline+dedup

How will these two features, cp --reflink and Offline Deduplciation, improve OpenZFS?

Again for discussions on how these features improve OpenZFS there's plenty other threads all over the place from mailing lists to discussions on here etc. I think at this point we are (or should be) past the point of proving their worthiness to be implemented. In fact, almost guaranteed, even those opposed or unconvinced of need/usefulness, "will" use these features when they finally arrive.

ORACLE ZFS REFLINK: https://blogs.oracle.com/solaris/post/reflink3c-what-is-it-why-do-i-care-and-how-can-i-use-it


Although unfortunately my C programming is nowhere close to the sophistication needed to implement this, I have been contemplating funding BOUNTIES and/or possibly looking to hire some programmers for at least initial PROOF of Concept.

Now what is the BEST way to go about this on LINUX? Here I would appreciate some feedback on:

Support for FIDEDUPERANGE https://github.com/openzfs/zfs/issues/11065 https://man7.org/linux/man-pages/man2/ioctl_fideduperange.2.html https://github.com/torvalds/linux/search?q=fideduperange

FICLONE https://github.com/torvalds/linux/search?q=ficlone https://manpages.debian.org/testing/manpages-dev/ioctl_ficlone.2.en.html

Clonefile: Use FICLONE instead of BTRFS_IOC_CLONE on Linux. https://github.com/git-lfs/git-lfs/issues/3792

FICLONERANGE https://man7.org/linux/man-pages/man2/ioctl_ficlonerange.2.html https://github.com/torvalds/linux/search?q=ficlonerange

remap_file_range https://www.kernel.org/doc/html/latest/filesystems/vfs.html#struct-file-operations https://github.com/torvalds/linux/search?q=remap_file_range&type=code https://github.com/torvalds/linux/search?q=remap_file&type=commits

sys_reflink/vfs_reflink https://github.com/torvalds/linux/search?q=sys_reflink https://lwn.net/Articles/331808/ https://lwn.net/Articles/332802/ https://lwn.net/Articles/331576/

https://stackoverflow.com/questions/65505765/difference-of-ficlone-vs-ficlonerange-vs-copy-file-range-for-copy-on-write-supp

COPY_FILE_RANGE https://git.savannah.gnu.org/cgit/coreutils.git/commit/src/copy.c?id=4b04a0c3b792d27909670a81d21f2a3b3e0ea563 http://manpages.ubuntu.com/manpages/bionic/man2/copy_file_range.2.html https://man7.org/linux/man-pages/man2/copy_file_range.2.html https://github.com/torvalds/linux/search?q=copy_file_range SYNCTHING example API use: https://docs.syncthing.net/advanced/folder-copyrangemethod.html

ZFS support copy_file_range() syscall #4237 https://github.com/openzfs/zfs/discussions/4237

XFS work towards REFLINK support: https://lore.kernel.org/linux-fsdevel/[email protected]/T/ https://blogs.oracle.com/linux/post/upcoming-xfs-work-in-linux-v48-v49-and-v410-by-darrick-wong https://blogs.oracle.com/linux/post/xfs-data-block-sharing-reflink

https://lwn.net/Articles/789623/ "For example, Btrfs could support copy_file_range(); there are cases where Btrfs knows how to copy faster and, if it doesn't, user space can fall back to what it does today. There are five or so filesystems in the kernel that support copy_file_range() and Btrfs could do a better job with copies if this copy API is invoked;"

Now in terms of implementation feedback, guidance and recommendations, I would appreciate some things clarified:

https://www.ctrl.blog/entry/file-cloning.html "OpenZFS isn’t part of the Linux kernel because of licensing issues, and that is unlikely to change. OpenZFS doesn’t support any of the relevant Linux syscalls for cloning files or blocks. It doesn’t offer a replacement for these syscalls on FreeBSD or Linux. (This is why there are no out-of-band deduplication tools for OpenZFS.) Bcachefs isn’t in the kernel yet either, but it’s developed under a Linux-kernel compatible license with the ultimate goal of being merged into the kernel. It supports all the relevant Linux-specific syscalls for file cloning."

I'm a bit confused because looking through the Linux Kernel code, I did NOT see any "EXPORT_SYMBOL_GPL" for any of the kernel APIs system calls in relation to any of the above mentioned ficlone, ficlonerange, copy_file_range, fidedupe, fideduperange, etc etc.

I think this should be addressed because it seems many people are under the "wrong" impression that OpenZFS has avoided trying to implement REFLINKS by making use of any deep Linux kernel APIs because of some concern with licensing issues and GPL exports etc.

The confusion in regards to this is only growing, see this very INTERESTING 7-page thread about Wine: https://www.phoronix.com/forums/forum/software/linux-gaming/1269446-proposed-reflink-support-would-provide-big-space-savings-for-wine/page7

"...There is a completely different problem here. ZoL developers don't want to say what the problem is. Hooking up Linux kernel reflink interfaces that are under GPLv2 only to CDDL only ZFS deduplication code where the ZFS reflink code is happened to be very legally problematic. Telling users to use other features or there is no gains to the reflink feature means they don't have to say they are license stuck."

"ZFS by design has reflink. If file system driver exposes that to user space that a different matter. Of course when you understand the license problem here you understand why the ZoL developers are attempting to weasel their out of having to implement reflink as userspace usable feature even that they are using reflink feature internally."

Now there are other interesting technical points made in those 7 pages of discussion on phoronix, but I think someone should address the misconception quoted above, because I believe its not true and may be misleading people into giving up on ZFS thinking that somehow there is some "huge license issue" which will "never ever" allow OpenZFS to grow to its "full potential" on "LINUX".

Anyone know any of the supposed "reflink interfaces" that are "under GPLv2 ONLY"? I didn't see any limited/exported as "GPL ONLY".

Again, I have looked and I did not see anything preventing OpenZFS from using any of those APIs and system calls that XFS uses or that OCFS2 uses or used or BCACHEFS or btrfs etc. I saw no licensing issues, not even export_symbol_gpl, to hinder OpenZFS usage.

If there was any reluctance on the part of OpenZFS maintainers to forego using some of the Linux APIs above to implement reflinks, I'm guessing it had to do with them being LINUX "specific" and likely their desire to have something that also works cross-platform.

So this is the main reason for this FAST-TRACK FEATURE REQUEST for LINUX ONLY. Too many have cried for this feature since 2010.

Those who have the knowledge and skills, please provide any technical feedback and suggestions on the best approach to implement CP Reflink and Offline Deduplication that works on LINUX only, ideally by leveraging existing LINUX system API calls to facilitate it all.

BEST approach should prioritize:

  1. Easiest/Fastest implementation that works as expected to the end user. ( Meaning hopefully a PROOF-of-CONCEPT in "weeks"? )
  2. AVOID forcing much code changes if any on other ZFS internal code that might have other consequences for others and result in push-back from them, again yet another reason to make use of existing LINUX APIs as much as possible instead, or separate ZFS code that can be easily circumvented or only enabled at build-time for those who wish to do so, If there 'must', we can have a new "FORK".
  3. If the best path entails using some Linux calls that are GPL-only such as "Export_Symbol_GPL", please document, but it may not be a deal breaker since many/most who want these features are "end users" where "Distribution" and "GPL" issues are "not relevant at all".

If you are able to commit to help pay/fund/reward willing talent to work on this feature it may help. If you are someone who says: "I don't have money. But what I do have are a very particular set of skills, skills I have acquired over a very long career, skills that..."

Even better! Please let us know if you're willing to give these features a try, or help/guide others willing to, and if some money can help motivate some, I see no problem with that, in fact as I said I was looking for ways to help via bounties or some contract work.

(I'm not sure what the rules are on here, but I "assumed" that if companies can offer ZFS developers funds to work on features for them, then I would think its ok for regular people to offer bounties/rewards/funds to those willing to tackle features for community.)

Hopefully some of the ZFS gurus and/or leadership on here can provide some feedback and guidance so this "Fast-Track" can start! :)

(But if someone can't wait to rain on a parade, please do give detailed technical reasons why these features are still "impossibly difficult" compared to XFS and other implementations, regardless of Linux APIs available, showing exactly "why" that is the case.)

If you also would love to see these features "FAST-TRACKED" just for Linux, please make your voice heard :) Thanks!

jittygitty avatar Apr 20 '22 09:04 jittygitty

I think that this is a use case covered by "Block Reference Tracking", or BRT, which @pjd is working on. For details please see the several times this has been discussed at the OpenZFS Leadership Meeting, by searching this doc for "block reference". Last I recall, he was looking for someone to help implement the ZPL layer changes for linux, i.e. hooking it up to reflink.

ahrens avatar Apr 21 '22 04:04 ahrens

@ahrens Thanks for the reply, I was actually about to start on contracting someone to strace/ftrace (or dtrace :) ) through btrfs, ocfs2, and xfs during dedupe, then analyze their sources and just come up with a similar path for implementation on OpenZFS.

What was fascinating about xfs implementation is that they weren't even copy-on-write, and do it only when needed for reflink.

http://ftp.ntu.edu.tw/linux/utils/fs/xfs/docs/xfs_filesystem_structure.pdf "To support the sharing of file data blocks (reflink), each allocation group has its own reference count B+tree, which grows in the allocated space like the inode B+trees. This data could be collected by performing an interval query of the reverse-mapping B+tree, but doing so would come at a huge performance penalty. Therefore, this data structure is a cache of computable information."

I didn't know BRT progressed to almost done now. I had read of it few years ago and it seemed to be useful for several dedupe scenarios I think and thought it seemed a wonderful idea but looked far off. I'm just wondering how big is that "2%" left to go?

Do you have any technical info about the "Panzura" temporal dedupe code that they were going to open source to OpenZFS?

I guess that as with BRT, you plan implementing both reflinks and offline deduplication in a more "cross-platform" manner by not tying in too deeply into Linux-only system calls etc? Because the reason I opened this ticket is I was tired of waiting so long and simply wanted it achieved with no regard to avoiding use of any gpl-only hooks or cheats via fuse or user-space Linux helper utils/code interfacing with ZFS, all within rights though, that's why the point to keep it externalized and "optional" etc.

So if BRT is actually very close to being able to provide Reflinks and Offline Dedupe, sure that would be some awesome news!

Years ago I had more cash-flow and would have gladly put thousands towards something like this but currently since I'm still not up to par myself for even the ZPL layer and Linux VFS layer and refllink/remap/ficlone/fidedupe hookup I'd need to search for an affordable overseas coder to fund to give it a shot. Does he have a page or issue/feature open for such ZPL work etc?

I've always said these issues needed a sort of crowd-funding button built-in, kind of kick-starter style, funds accumulate and he that implements gets the key. Some don't like it, but its not that much different than having corporate sponsors from my view.

Anyway on a related note, and I'm ok with some private message if certain etiquette behooves it on project maintainers, I would appreciate some feedback to my recent building questions here: https://github.com/openzfs/zfs/issues/11357

thx!

jittygitty avatar Apr 21 '22 10:04 jittygitty

@pjd Thanks for your work on BRT. I was hoping you could help offer some guidance and also clarify a couple things.

After watching your video "File Cloning with Block Reference Table by Pawel Dawidek" from the 2020 OpenZFS Developers Summit, I was hoping you could clarify the technical "reasons" as to why cp --reflink implementation must be much more difficult "compared to BTRFS or XFS" support for "cp --reflink".

Because in the 1st minute of the video, you described our question exactly, ie ZFS is CoW copy-on-write and already has "snapshots", something that XFS did not have before it implemented "cp --reflink" and maybe still doesn't "fully have". So could you or perhaps @ahrens or @behlendorf please try to explain a bit to us the technical reasons why its much more difficult for ZFS to do reflinks "versus XFS" or btrfs? (Is it blocks versus extents the problem? How? Or B+trees vs Merkle?)

The other question I had is what does XFS and btrfs use that is the functional equivalent to the BRT (Block Reference Table) that XFS or btrfs uses for cp --reflink and offline deduplication?

Also @pjd in the previous post @ahrens said that you were "looking for someone to help implement the ZPL layer changes for linux, i.e. hooking it up to reflink." (which I think uses FICLONE ioctl?) Could you please offer some guidance as to which files in the ZFS tree should probably be modified and/or what directory to add equivalent of btrfs reflink_copy reflink.c etc new code?

jittygitty avatar Apr 25 '22 11:04 jittygitty

After watching the video ( https://www.youtube.com/watch?v=hYBgoaQC-vo ) on possible reflinks implementation via new BRT feature (see also https://www.youtube.com/watch?v=AkWVDs5VZIY ), I was bummed out to hear that via that method it would be impossible to preserve reflinks over zfs send/receive. So I started to dig a bit deeper...

I know that Oracle Solaris ZFS supports reflinks, though I don't know if they break during zfs send/receive.

But I've read that btrfs can indeed preserve reflinks over btrfs send/receive and so I was wondering if we could implement an "extents" based abstraction layer so-to-speak that could leverage existing reflinks code or methods already inside the linux kernel which are being used by btrfs/xfs/ocfs2 etc.

Would that not save us some work and even help make preserving reflinks over zfs send/receive possible?

(As a side-point are there any additional constraints on preserving reflinks over zfs send/receive if encryption is enabled? Since I'm reminded of this dedupe issue https://github.com/openzfs/zfs/discussions/9423 )

Searching for extents<->blocks abstraction methods I found this interesting class assignment: "In this project, you'll be changing the existing xv6 file system to use extents rather than pointers." https://pages.cs.wisc.edu/~remzi/Classes/537/Fall2011/Projects/p5.html

So again, the whole point of this FAST-Tracking REFLINK support "first" for LINUX only is that I think we can do this faster/better if we postpone supporting BSD or illumos kernels which don't already have existing "reflinks" related code in kernel.

Perhaps BRT can have a place in facilitating the blocks<->extents tracking/remapping, but after bit of online research I "discovered" the existence of Linux "FIEMAP"!

https://www.kernel.org/doc/html/latest/filesystems/fiemap.html Fiemap, an extent mapping ioctl (2008) https://lwn.net/Articles/283771/ https://github.com/torvalds/linux/blob/master/Documentation/filesystems/fiemap.rst

And it is not a gpl-only export: https://github.com/torvalds/linux/search?q=fiemap+AND+export_symbol https://github.com/torvalds/linux/blob/88e6c0207623874922712e162e25d9dafd39661e/fs/ioctl.c https://github.com/torvalds/linux/blob/master/include/uapi/linux/fiemap.h https://github.com/torvalds/linux/blob/master/include/linux/fiemap.h

(Though iomap_fiemap and iomap_bmap are both gpl-only exports that shouldn't be a problem. If you think it is, explain.) https://github.com/torvalds/linux/blob/master/fs/iomap/fiemap.c

I found a very interesting article on clonefiles/reflinks and fiemap here: http://www.wolczko.com/nvm-blog/Clonefiles.pdf

https://stackoverflow.com/questions/46417747/apple-file-system-apfs-check-if-file-is-a-clone-on-terminal-shell https://github.com/dyorgio/apfs-clone-checker https://github.com/torvalds/linux/blob/master/include/uapi/linux/fiemap.h https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt

Unfortunately for some reason his extents code repository: https://github.com/mwolczko/extents is no longer available.

Also INTERESTING read: "We have built a prototype implementation of MapFS based on Btrfs. We make use of an existing Btrfs ioctl, BTRFS IOC CLONE RANGE, to perform much of the hard work of creating new mappings. MapFS adds around 200 lines of code to the Btrfs sources and presently supports only the fremap function." https://www.usenix.org/legacy/event/hotstorage11/tech/final_files/Wires.pdf

See FREMAP linux code: https://github.com/torvalds/linux/search?q=fremap&type=commits https://docs.huihoo.com/doxygen/linux/kernel/3.7/fremap_8c_source.html https://github.com/torvalds/linux/search?q=rmap https://github.com/torvalds/linux/blob/master/fs/remap_range.c https://github.com/torvalds/linux/blob/master/mm/rmap.c

Anyway it seems that with an extents based approach there's plenty of existing Linux kernel reflinks enabling code that can be leveraged directly or indirectly, but if I'm missing something please do explain. I would appreciate some feedback.

Perhaps implementing FIEMAP should be a priority helpful in many ways, from reflinks implementation to other features such as filefrag support?

@ryao @behlendorf @tonyhutter @ahrens @lundman @happyaron @pjd I would really appreciate your thoughts on what I wrote about above, in terms of leveraging Linux FIEMAP and existing extents based reflink code etc.

I searched openzfs tree: https://github.com/openzfs/zfs/search?q=fiemap&type=issues https://github.com/openzfs/zfs/issues/264 https://github.com/openzfs/zfs/issues/7110 Expose the number of hole blocks in a file https://github.com/openzfs/zfs/pull/7392 https://github.com/openzfs/zfs/issues/11900

It looks like FIEMAP support was started already but just stalled? Did it stall because it needs to be redone or the existing code is fine and just needs bit of work to complete? https://github.com/openzfs/zfs/pull/7545 https://github.com/openzfs/zfs/pull/9554 https://github.com/openzfs/zfs/issues/9552 https://github.com/openzfs/zfs/pull/9553

It seems running filefrag on zfs gives filefrag fibmap "invalid argument" which is apparently because fiemap isn't implemented so it tries fibmap instead which still gives errors? https://man7.org/linux/man-pages/man8/filefrag.8.html https://www.linux.org/threads/intro-to-extents.8625/

Maybe fiemap may help in future attempts and new methods at tackling long term fragmentation? ZFS fragmentation and BPR block pointer rewrite https://github.com/openzfs/zfs/issues/3582 https://groups.google.com/access-error?continue=https://groups.google.com/a/zfsonlinux.org/g/zfs-discuss/c/-/m/rsuNhybgf7IJ

For reference (todo move more links to file attachment?) https://github.com/torvalds/linux/search?q=reflink https://github.com/torvalds/linux/blob/master/fs/btrfs/reflink.h https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_reflink.c https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_reflink.h OCFS2 reflinks support https://lwn.net/Articles/402287/ https://github.com/torvalds/linux/search?q=ocfs2_reflink_remap_blocks&type=code https://github.com/torvalds/linux/blob/a48b0872e69428d3d02994dcfad3519f01def7fa/fs/ocfs2/refcounttree.h https://github.com/torvalds/linux/blob/3bf03b9a0839c9fb06927ae53ebd0f960b19d408/fs/ocfs2/file.c XFS: "Pull vfs dedup fixes from Dave Chinner: "This reworks the vfs data cloning infrastructure."" https://github.com/torvalds/linux/commit/c2aa1a444cab2c673650ada80a7dffc4345ce2e6

Interesting we have a B-TREE in ZFS already? https://github.com/openzfs/zfs/blob/1c41d8941cb5a76d71930d2af976c376c05ed318/module/zfs/btree.c

Interesting FIEMAP links in relation to reflinks on Linux: https://www.spinics.net/lists/linux-btrfs/msg110130.html copy_file_range and fiemap: https://mail.gnu.org/archive/html/bug-cpio/2021-03/msg00007.html New ->fiemap infrastructure and ->bmap removal: https://lwn.net/Articles/795013/ XFS_IO fiemap ficlone dedupe etc: https://man7.org/linux/man-pages/man8/xfs_io.8.html LUSTRE LREFLINK Reflink support: https://www.eofs.eu/_media/events/devsummit19/lustre_reflink.pdf https://wiki.lustre.org/Lreflink_High_Level_Design https://unix.stackexchange.com/questions/263309/how-to-verify-a-file-copy-is-reflink-cow OCFS2/Reflink-Illustrated: https://oss.oracle.com/osswiki/OCFS2/Reflink-Illustrated.html https://github.com/torvalds/linux/find/master

"potentially" related or useful: Add TRIM support #8419 https://github.com/openzfs/zfs/pull/8419 SEEK_DATA fails upon a file just opened after being rewritten by mmap #11697 https://github.com/openzfs/zfs/issues/11697 RFC: using DAX as a workaround to access ZVOLs (and possibly ZPL entries) #9986 https://github.com/openzfs/zfs/issues/9986 OpenZFS - 6363 Add UNMAP/TRIM functionality #5925 https://github.com/openzfs/zfs/pull/5925 How to compare two datasets for verification? One of them is replicated from another one. #8888 https://github.com/openzfs/zfs/issues/8888 TRIM/UNMAP/DISCARD support for vdevs (2) #1016 https://github.com/openzfs/zfs/pull/1016 SATA trim for vdev members #598 https://github.com/openzfs/zfs/issues/598 Add interface for file hole punching #168 https://github.com/openzfs/spl/pull/168 Implement fallocate FALLOC_FL_PUNCH_HOLE #2619 https://github.com/openzfs/zfs/pull/2619 https://github.com/openzfs/zfs/search?q=extent&type=code https://github.com/openzfs/zfs/search?q=extents&type=code

Apologies for long post, I realize its a bit of a brainstorm but hopefully helpful to those able/willing to contribute code towards fast-tracking reflink and offline-dedupe support on LINUX. I'll try clean it up later, maybe move some links into file attachment,

Now I realize that @pjd who works on FreeBSD maybe can't contribute code to this "FAST Tracking REFLINK support" that would only benefit LINUX first for now. Same likely for those using opensolaris/illumos kernel. So what I'd like to ask is this:

@ahren @wca @behlendorf @pjd @scsiguy @allanjude @delphij @lundman @mmatuska @grwilson @happyaron @ryao Please feel free to notify others I may have missed! (I can't find Chris Siden handle, all taken from: https://openzfs.org/wiki/Contributors )

IMPORTANT QUESTION below for the OpenZFS maintainers, project leaders, and top contributors:

Would you "ACCEPT" a feature implementation Pull-Request that: A. Required running a script to modify the Linux Kernel sources or at least Kernel Header sources, before building ZFS?

Would you "ACCEPT a feature implementation Pull-Request that: B. Required running a script that "modified" the ZFS/module sources? (similarly to A above, probably the reason would be to avoid any 'potential' license warriors having any avenue to complain about code licensing incompatibilities. Doing away with any "distribution" and giving the ultimate "end user" the option of running a "script" that patches, disarms such hardliners.)

Of course, code "refactoring" and re-organizing could do away with the need of either A or B altogether but I'm really yearning for cp --reflinks support as fast as possible hence this ticket "FAST Tracking REFLINKS", so I'm looking for the easiest/fastest route to implementation even if it needs to be "cleaned" up a bit for official acceptance into ZFS tree etc. Initially, we could have a separate zfs reflink utility instead of using the gnu linux coreutils cp command, shortcuts can be refactored/fixed later.

Can we do a POLL here to see what BOUNTY platform we should use, where all interested in reflink/offline-dedupe can fund?

I was looking at BountySource https://bountysource.com but heard lately that after some time you can't get your unclaimed funds back from them. I just found out about https://www.openbugbounty.org Would be good to have a poll and decide on one since there's so many: ( https://gitpay.me , https://gitcoin.co , https://bounty0x.io/bnty , https://issuehunt.io , https://tip4commit.com , https://liberapay.com , https://flattr.com , http://en.goteo.org , https://gitcoin.co )

Would be great to get those would-be ZFS patron funders together to help fast-track this issue and others. So if you're able and willing to contribute funds or if you're a bounty hunter, would be great to get feedback and preferred "platform" to use.

This is a mission for someone with "very particular set of skills, skills"..."acquired over a very long career, skills that..." Anyone out there that would choose to accept it? ZFS_reflink_offline_dedupe_ResearchLinks_etc_jittygitty.txt

jittygitty avatar Apr 30 '22 03:04 jittygitty

I think that this is a use case covered by "Block Reference Tracking", or BRT, which @pjd is working on. For details please see the several times this has been discussed at the OpenZFS Leadership Meeting, by searching this doc for "block reference". Last I recall, he was looking for someone to help implement the ZPL layer changes for linux, i.e. hooking it up to reflink.

@ahrens Are we sure BRT is needed for file-level reflink? It seems something as a "file-level snapshot", right?

shodanshok avatar May 09 '22 07:05 shodanshok

FIEMAP has the same issue as bmap, which assumes 1 filesystem = 1 block device. That is inherently incompatible with multiple disk pools (unless the additional disks are for L2ARC, SLOG or a special device that is only for metadata). When it is used to write directly to the disk (effectively bypassing the filesystem), it is incompatible with checksums. It also should not interact well with compression or encryption. If it were implemented for the single disk case, I can imagine software that uses it being confused by the copies property too.

ryao avatar Sep 14 '22 18:09 ryao

I think that this is a use case covered by "Block Reference Tracking", or BRT, which @pjd is working on. For details please see the several times this has been discussed at the OpenZFS Leadership Meeting, by searching this doc for "block reference". Last I recall, he was looking for someone to help implement the ZPL layer changes for linux, i.e. hooking it up to reflink.

@ahrens Are we sure BRT is needed for file-level reflink? It seems something as a "file-level snapshot", right?

I do not see another way of doing it.

ryao avatar Sep 14 '22 18:09 ryao

@ryao I would naively expect that the basic requirements for reflink are the same of snapshot/clone, but at the file level rather than dataset.

I am sure to be wrong (otherwise reflink would already be implemented), but I would just like to understand why. Thanks.

shodanshok avatar Sep 14 '22 19:09 shodanshok

Off the top of my head, datasets have their own dedicated objsets in which the book keeping needed for operations on snapshots and clones are kept. Individual files do not.

ryao avatar Sep 14 '22 20:09 ryao

@ryao If zfs architecture is different in that it requires dealing with pools/datasets not directly with block devices to map file extents, does fiemap have to "know" any of that if we have some abstraction layer? https://www.kernel.org/doc/html/latest/filesystems/fiemap.html

jittygitty avatar Sep 17 '22 21:09 jittygitty

@jittygitty I do not understand what you mean by an abstraction layer.

The fiemap ioctl assumes 1 filesystem = 1 block device. It expects 64-bit LBAs. Internally, ZFS does not use 64-bit LBAs. Instead, it uses 128-bit DVAs, which primarily contain a vdev number, an offset and a size. This is inherently incompatible with fiemap. There is no way to translate this into something useful to fiemap.

I do not understand why you even are asking about fiemap. It is not useful for reflink support.

ryao avatar Sep 17 '22 21:09 ryao

The confusion in regards to this is only growing, see this very INTERESTING 7-page thread about Wine: https://www.phoronix.com/forums/forum/software/linux-gaming/1269446-proposed-reflink-support-would-provide-big-space-savings-for-wine/page7

"...There is a completely different problem here. ZoL developers don't want to say what the problem is. Hooking up Linux kernel reflink interfaces that are under GPLv2 only to CDDL only ZFS deduplication code where the ZFS reflink code is happened to be very legally problematic. Telling users to use other features or there is no gains to the reflink feature means they don't have to say they are license stuck."

"ZFS by design has reflink. If file system driver exposes that to user space that a different matter. Of course when you understand the license problem here you understand why the ZoL developers are attempting to weasel their out of having to implement reflink as userspace usable feature even that they are using reflink feature internally."

Now there are other interesting technical points made in those 7 pages of discussion on phoronix, but I think someone should address the misconception quoted above, because I believe its not true and may be misleading people into giving up on ZFS thinking that somehow there is some "huge license issue" which will "never ever" allow OpenZFS to grow to its "full potential" on "LINUX".

Done:

https://www.phoronix.com/forums/forum/software/linux-gaming/1269446-proposed-reflink-support-would-provide-big-space-savings-for-wine?p=1346541#post1346541

I do not like to speak ill of people, but that guy is well known for spreading FUD about non-GPL software. He is not a reliable source of information.

ryao avatar Sep 17 '22 21:09 ryao

@ryao Many thanks for posting on there and clarifying the misconceptions, I knew the guy was wrong, but I didn't have the technical expertise to prove him wrong. I was upset that people keep dismissing zfs based on some conceived huge "License" incompatibility, which as you have seen in my Licensing posts on here, I had been trying hard to show that ZFS license is not a problem for Linux at all.

(see my attempt to address LICENSE incompatibility naysayers here: #13415 )

As far as the usefulness of fiemap, I thought (but perhaps I was wrong, so you can correct me) that its useful to have to be able to use regular userland tools to leverage reflinks for such things as deduplication or filefrag for example. I'm guessing we could have separate zfs specific userland utilities, but thought may be useful to be able to use existing userland tools that can already use fiemap etc.

When I posted this issue, I had asked those with superior knowledge (I'm not one), to please enlighten the rest of us a bit, it took a while, but I'm glad you've posted :)

Anyway, when I discovered that @pjd made unexpected fast progress with a more cross-platform approach instead of just fast-tracking Linux, thought that was great.

Thanks again for taking the time to address those "LICENSE incompatibility" conspiracy theories as being "the reason" why ZFS didn't have reflinks yet! (ie at: https://www.phoronix.com/forums/forum/software/linux-gaming/1269446-proposed-reflink-support-would-provide-big-space-savings-for-wine?p=1346541#post1346541 )

Also, I think that perhaps many of us initially thought the existing CoW architecture and existing snapshot/clones functionality should have made reflinks support easy.

But it's interesting what you pointed out that: "People had proposed reusing this logic in the past to implement reflinks, but it never went anywhere. The main problem with it is that it could not allow you to make reflinks across datasets."

https://www.phoronix.com/forums/forum/software/linux-gaming/1269446-proposed-reflink-support-would-provide-big-space-savings-for-wine?p=1346549#post1346549

I'm curious, going that route, and being limited to reflinks only "within" datasets not across them, what are the "advantages" of such a method, if any? Easier than current BRT effort or not really, how about implication for zfs send/receive? Thanks!

(I for one would have been very HAPPY to have reflinks even just inside a dataset! )

jittygitty avatar Sep 18 '22 23:09 jittygitty

I should clarify the "reason" that path @ryao mentioned referencing snapshot/clone intrigues me, is because last I heard @pjd said that preserving reflinks over ZFS SEND/RECEIVE would "not be possible" via his BRT method.

And given the two options below, I personally would prefer Option (B):

A. You can get Pool-Wide across datasets offline-deduplication and reflinks, BUT they will break over ZFS send/receive and will take double space on other side. B. You can get Offline Deduplication and cp reflinks which preserve fine over zfs Send/Receive, BUT all this will be ONLY "within" same dataset not across them.

I'm curious, given the above, am I the ONLY one who'd prefer option B?

jittygitty avatar Sep 20 '22 00:09 jittygitty

@jittygitty I am not sure if the earlier reflink idea would have allowed them to be sent over send/recv. Honestly, any extensions to maintain reflinks over send/recv would break backward compatibility, so I would rather not attempt that. There was a plan to implement ZFS version 1000 (as in zfs upgrade -v) and use feature flags so that extensions like these could be done, but that never materialized. We would want to do that as a prerequisite for any such functionality.

ryao avatar Sep 20 '22 00:09 ryao

can i ask something. from PR 13392

Interaction between Deduplication and Block Cloning.

... To avoid this dilemma BRT cooperates with DDT - if a given block is being cloned using BRT and the BP has the D (dedup) bit set, BRT will lookup DDT entry and increase the counter there. No BRT entry will be created for a block that resides on a dataset with deduplication turned on. ...

if you change this process to force dedup bit set and add or increase the counter to DDT. isn't it works for reflink and without too much break change? or i miss ssmething? i think BRT is need for performance matter only.

mix5003 avatar Sep 20 '22 02:09 mix5003

can i ask something. from PR 13392

Interaction between Deduplication and Block Cloning. ... To avoid this dilemma BRT cooperates with DDT - if a given block is being cloned using BRT and the BP has the D (dedup) bit set, BRT will lookup DDT entry and increase the counter there. No BRT entry will be created for a block that resides on a dataset with deduplication turned on. ...

if you change this process to force dedup bit set and add or increase the counter to DDT. isn't it works for reflink and without too much break change? or i miss ssmething? i think BRT is need for performance matter only.

Unfortunately, that bit is in every block pointer everywhere. If you change it in just one, then the checksum in the pointer pointing to that pointer is invalid, requiring you to make a new checksum, making yet another checksum invalid. This continues until you get to the root of the merkle tree. You also need to do this for every block that is missing the set bit. There are no return pointers either, so you need to remember how you got there, and also consider pointers coming from old txgs in case you need to do rollback… that idea is called block pointer rewrite and it is basically a unicorn. Sun had it in the lab when Oracle acquired them, but it did not perform well.

Minor disk format changes would be needed to reuse the DDT for reflinks in general and the end result would not be as performant as BRT.

ryao avatar Sep 20 '22 02:09 ryao

While I am looking forward to see this implemented, and thank anyone involved, breaking reflinking on send/recv is a huge red flag to me. This basically means that any source dataset with heavy reflink usage would be non-transferable to a similar destination pool due to out-of-space concerns.

shodanshok avatar Sep 20 '22 06:09 shodanshok

@ryao I had read (ashamedly I haven't tested it yet myself) that btrfs can preserve reflinks and deduplication over btrfs send/receive (maybe via -c or -p flag ).

It seems their send/receive, streams tons instructions to write/link/clone, that gets replayed etc. If zfs send/receive needs to be extended to allow similar feature, wouldn't that be doable, and unobjectionable if on-disk format won't be touched?

(I'm thinking there may be ways to avoid breaking backward compatibility since older zfs won't have reflinks etc? And omitting newer flag/arguments to older zfs. Seems doable, but it's late and I'm tired so apologies if I miss some big elephant!)

jittygitty avatar Sep 20 '22 07:09 jittygitty

@mix5003 Yes, I think you are right, it seems BRT is like a "lite" version of the DDT. I'm no expert but I think if you force dedupe on to do the counters, then you have to keep it on all the time and dedupe on has been problematic due to RAM usage and performance so that just about "almost everyone" runs away from dedupe on.

jittygitty avatar Sep 20 '22 07:09 jittygitty

@ryao I had read (ashamedly I haven't tested it yet myself) that btrfs can preserve reflinks and deduplication over btrfs send/receive (maybe via -c or -p flag ).

It seems their send/receive, streams tons instructions to write/link/clone, that gets replayed etc. If zfs send/receive needs to be extended to allow similar feature, wouldn't that be doable, and unobjectionable if on-disk format won't be touched?

(I'm thinking there may be ways to avoid breaking backward compatibility since older zfs won't have reflinks etc? And omitting newer flag/arguments to older zfs. Seems doable, but it's late and I'm tired so apologies if I miss some big elephant!)

We could hide a backward incompatible extension via a flag to zfs send and a bit could be set on the stream when that is done that would break recv backward compatibility, but the first thing that needs to be done is to merge BRT into master. An extension to send/recv would be a follow up improvement.

ryao avatar Sep 20 '22 12:09 ryao

@ryao As far as using fiemap for userland utils like Filefrag and offline dedupe tools etc, what I meant by abstraction was some mechanism like @adilger I see proposed at: https://github.com/openzfs/zfs/issues/264#issuecomment-2636343 "Lustre uses a patched version of filefrag which allows returning the underlying device for each extent (fe_device), because the file is not located on a single LUN. For Lustre this is an index value (0-N). For ZFS one might either return the Linux block device (major << 16 | minor) or 32 bits of the VDEV GUID or similar." also: https://github.com/openzfs/zfs/issues/264#issuecomment-346186484 If you have time to take a look sometime, your input could be useful in the fiemap PRs etc. I've learned a lot from all of you guys on here, and still learning, so thanks!

In fact, I still wish there was a way for all of us little guys to contribute in some sort of crowd-funding way towards features or specific code contributors, ie my #13397

And thanks again for confirming you think an extension to zfs send/recv is doable, since that's a huge deal for a lot of us I think, or at least for me and @shodanshok !

jittygitty avatar Sep 25 '22 01:09 jittygitty

Sorry if this is the wrong place for this post... @pjd - First of all, thank you for working on this feature, which is very much welcomed by many OpenZFS users. I watched your explanation of BRT in the video of the January (or February?) 2022 OpenZFS Leadership Team meeting. It was very useful to understand the high level principles behind the implementation. I see that your typical use case is relatively large (at least for me) files, of around 1GB. I would like to ask you to please not restrict yourself to this case when making design decisions about BRT. There are many applications where the duplicated files would be much smaller. One example (that Matt already made in the video) is camera images. Another one is executables and libraries in LXD root containers - if you have many users using their own container(s) (even if you start from cloning the same image) after a while they will update their systems, install files, and many (most?) of those new files will be duplicated. Allowing for an offline tool that can search for duplicates and replace them with reflinks (like dduper or bees for BTRFS) could lead to huge savings. So a BRT that works efficiently (in time and space) also for smaller files would be very important for many of us. Thanks!

lciti avatar Dec 02 '22 22:12 lciti

Hi,

Current implementation of reflinks(from git/master) allows send/receive? Does it has any limitations or is unstable?

isopix avatar Mar 27 '23 20:03 isopix

@isopix Did you test on Linux or on BSD? I thought they said it wasn't hooked up to cp --reflink on Linux 'yet'?

( @pjd 's PR for Block Cloning; ie dedupe/cp --reflink via BRT Block Reference Table: https://github.com/openzfs/zfs/pull/13392 )

Although it seems they've been working on replay operations which means maybe they're trying to get receive side to replay the deduplication? I only skimmed quick.

jittygitty avatar Mar 28 '23 00:03 jittygitty

@isopix Did you test on Linux or on BSD? I thought they said it wasn't hooked up to cp --reflink on Linux 'yet'?

( @pjd 's PR for Block Cloning; ie dedupe/cp --reflink via BRT Block Reference Table: #13392 )

Although it seems they've been working on replay operations which means maybe they're trying to get receive side to replay the deduplication? I only skimmed quick.

I was thinking that reflinks and ZSTD Early Abort are done in git's master, and that's why so many people use it, instead of stable releases. But maybe I'm wrong.

I also forgot to put question mark in my previous post ;-)

isopix avatar Mar 28 '23 01:03 isopix

Hi,

Current implementation of reflinks(from git/master) allows send/receive? They do not persist across send/receive. The implementation is done at a very low level in the disk format that is not part of what is sent.

I had an idea for a way to partially persist the information in theory, but it sadly does not cover a case (ref links that reference already sent data in incrementals).

Does it has any limitations or is unstable?

It is missing test cases in the ZTS. Also, NULL pointer dereference bugs were found in it shortly after merging it. I made a comment on the PR fixing them that I did not think they went far enough, although the PR was enough to handle the issues that were initially found. We currently have a verify statement that will trip if that case is hit, necessitating a reboot.

It could be that it is solid, but we need more time to gather information.

Linux support is not yet implemented. I am hopeful that test cases will proceed Linux support since we run the tests far more often on Linux than FreeBSD and having them to run on Linux will do a good job to help shake out bugs.

ryao avatar Mar 28 '23 03:03 ryao

I'm currently working on Linux support and have a some kind of working version right now. But I'm fairly new to the ZFS code base. Thanks a lot to all people wich provided that much information about reflink and more. This helped me a lot to get into the VFS in Linux and the BRT in ZFS.

oromenahar avatar Jun 18 '23 10:06 oromenahar

Hi @oromenahar I started this thread but been away for a while. I just wanted to say a big well-earned CONGRATS to you!

For being so new to the ZFS codebase you have made some really great contributions! Many thanks to you and @robn for your recent work in getting Linux hooked up into BRT. If we keep gaining people like you, ZFS will advance even faster.

jittygitty avatar Aug 02 '23 22:08 jittygitty

@oromenahar a huge thank you from all of us ZFS users!!

diekhans avatar Aug 03 '23 15:08 diekhans