Fiemap[Still needs to address comments from original PR7545]
Motivation and Context
Description
How Has This Been Tested?
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Performance enhancement (non-breaking change which improves efficiency)
- [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Documentation (a change to man pages or other documentation)
Checklist:
- [ ] My code follows the ZFS on Linux code style requirements.
- [ ] I have updated the documentation accordingly.
- [ ] I have read the contributing document.
- [ ] I have added tests to cover my changes.
- [ ] All new and existing tests passed.
- [ ] All commit messages are properly formatted and contain
Signed-off-by.
Codecov Report
Merging #9554 into master will decrease coverage by
30.17%. The diff coverage is1.88%.
@@ Coverage Diff @@
## master #9554 +/- ##
===========================================
- Coverage 79.02% 48.85% -30.18%
===========================================
Files 418 276 -142
Lines 123686 95488 -28198
===========================================
- Hits 97748 46646 -51102
- Misses 25938 48842 +22904
| Flag | Coverage Δ | |
|---|---|---|
| #kernel | ? |
|
| #user | 48.85% <1.88%> (-17.97%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 6c7023a...cc9d510. Read the comment docs.
Hi Pavel,
Yes sure, sounds good.
Thanks, Rohan
On Sat, Nov 9, 2019 at 5:51 AM Pavel Snajdr [email protected] wrote:
RHEL7 .fiemap dir inode operation is in struct inode_operations_wrapper, just as .renameat2 - actually these two are the only ones; I have a commit testing for renameat2 being in the wrapper struct, you can see what I'm talking about here: 164ec05#diff-8171081b574f4a45da3be9970a68ce68R684 https://github.com/zfsonlinux/zfs/commit/164ec05f00f27eb1bed716f9427c8e84b4a0d18f#diff-8171081b574f4a45da3be9970a68ce68R684
I could make the test more generic and @rohan-puri https://github.com/rohan-puri you could use it in your patch then too.
@behlendorf https://github.com/behlendorf what do you think?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/pull/9554?email_source=notifications&email_token=AADHJKXNGZBTO5X3FTPOH73QSX7APA5CNFSM4JJMERC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDTXVIQ#issuecomment-552041122, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADHJKXI4X6PNHJVDXTWA3LQSX7APANCNFSM4JJMERCQ .
@rohan-puri never mind, I should have verified things before writing anything, it's .tmpfile that would be the other call in the wrapper ops struct, not .fiemap, thus completely irrelevant to this issue. It was a bit late and I've mistaken the two.
@snajpa : ok :)
@rohan-puri Good work taking this on! Could you list the changes you made to the original PR?
Could you also fillin the details of your PR, instead of relying on people looking up the old one? It think mostly a link to the old one and a shameless copy-paste would suffice.
Any update on this? This feature would be incredibly useful.
@rohan-puri @behlendorf @mmaybee Were you using Linux kernel's "generic_block_fiemap" anywhere? Just wondering if you saw no need for it since you're doing something different than I thought. Though I did just notice @adilger's post at https://github.com/openzfs/zfs/issues/264#issuecomment-2636343
And yes I know our friends removed the export free-for all generic_block_fiemap and replaced it with the GPL-only iomap_fiemap On July 27th 2021. (And they're just getting warmed up, hence some of my other recent issues I posted...)
fs: REMOVE generic_block_fiemap https://github.com/torvalds/linux/commit/9acb9c48b9408bbb6ade90e3f3192ee38e2589b3
"FIEMAP cleanups from Christoph transitioning all remaining filesystems supporting FIEMAP (ext2, hpfs) to iomap API and removing the old helper"
https://github.com/torvalds/linux/commit/63b0c403394d4e2644751d090db8a5da80272e50
@jittygitty thanks for taking an interest in this PR, but to be clear the kernel changes you referenced don't have any impact on the ability to implement this feature. What's needed is for someone to address the unresolved feedback provided in the original PR review. This still involves a significant amount of development work which I haven't had the time to pursue. If someone would like to pick up this work, I'm happy to help point them at what still needs to be done.
@behlendorf Yes it would be beneficial to explain the path of the implementation, I did not look at code much since I have only very basic C knowledge so I did not have full grasp if you avoided iomap usage due to exports or if there's no help in using. But either way yes if you could explain what needs to be done, I can try to hire someone to give a crack at it and would be good to point them here and read yours or other knowledgeable contributors explanation, instead as I've done before and try to bang my head and understand some basics to try and gain myself enough info on for example fiemap path for reflink implementing.
I'd rather not torture my head and potentially get things wrong. I had also asked for @pjd to provide what he needs for someone to finish implementing reflinks using "his BRT" method for Linux, I'm not against having two methods compete if one method/BRT is not going to waste anyway since will stay with BSD. Plus I'm not even "sure" if via fiemap I can bypass using BRT for reflinks, and use what, some kind of clone of xfs b+ refcounttree etc, not sure how BRT differs or if its actually same thing since I didn't "study the code of either" and prefer someone just spell it out. I want the feature for linux for myself, and simply wanted to gauge interest to help "crowd-fund" issues such as reflinks and offline-deduplication to get it done faster on linux.
@Ornias1993 I agree with you that there shouldn't be so much fear and avoidance of export_symbol_gpl, I would simply use anything we need. I've given some preliminary reasons why in https://github.com/openzfs/zfs/issues/13415
But I should add that I still prefer being reserved for politeness sake, there's no reason openzfs can't be good friends with Linux Kernel guys, their real hatred is for CLOSED SOURCE binaries, OpenZFS complies with basically "all" of the "REASONS" why they went with GPL when they did, ie their intent. OpenZFS is truly open source.
(Though it is unfortunate with DDN and purchase of Nexenta..., had some questions and complaints to let out, but I'll refrain in case there's openzfs contributors on here from Nexenta that might get upset, plus this is Fiemap issue tracker after all.)
The next steps for FIEMAP I believe are pretty straight forward. They code needs to be rebased against the current master branch and the outstanding feedback from comment in the PR needs to be resolved. In particular, there was one throrny design issue remaining regarding efficiently handling in-flight I/O which still needs to be resolve and that will take some careful work. Beyond that it of course needs to be tested and verified to work correctly. As always, if someone if willing to tackle this remaining work that would be great.
I'm not against having two methods compete if one method/BRT is not going to waste anyway since will stay with BSD.
These are complementary but different features, so developer time/interest permitting, I'd like to see both implemented. The vast majority of the BRT work is platform independent, this is the hard part of implementing the functionality for ZFS. Wiring it up for reflink on Linux should be relatively straight forward and is something we'll want to sort out in that PR.
@behlendorf Ok thanks, I think you missed one of my questions though, and I agree with you in general that cross-platform is better all else being equal. When I started looking into reflinks + offline-dedupe, I came to conclusion to do it via "FIEMAP".
But I thought implementing reflink via FIEMAP was "different" than how @pjd was doing it with his implementation via "BRT". So I wanted your clarification if I was correct on that part, but yea regardless of the answer, I'm all for "both" BRT and Fiemap.
Sorry, I know its kind of a messy brainstorm but if you read my comment in #13349 especially the clonefiles.pdf link (see https://github.com/openzfs/zfs/issues/13349#issuecomment-1113905164 ), I think you'll see my thoughts and can correct me.
Anyway I will contact someone I had in mind to see about taking a shot at finishing this 9554 FIEMAP, but you are saying that the original issue #7545 needs to be looked at, yet is all the code from that one already in this one, ie work in which one now?
What I was looking for though was a short "onboarding" for would be coder on this who does not have prior ZFS knowledge but is simply a good C/C++ coder perhaps with some kernel experience. I wanted to know if there's any "primer" that would be very helpful for them to look at before jumping in on this FIEMAP, since again, imagine its someone who never saw ZFS code.
But I thought implementing reflink via FIEMAP was "different" than how @pjd was doing it with his implementation via "BRT". So I wanted your clarification if I was correct on that part.
I see what you're getting at. I think the confusion here is that FIEMAP is solely about reporting logical/physical file extents. Now user space applications like cp can do some interesting things with that information, but FIEMAP does not provide any reasonable mechanism to implement something like reflink. That's a significantly more challenging problem, which is of course where the BRT work fits in. Or looked at another way, the BRT work enables a performant reflink implementation and the FIEMAP work provides a read-only reporting mechanism to see exactly how the data was laid out.
but you are saying that the original issue https://github.com/openzfs/zfs/pull/7545 needs to be looked at.
The original #7575 has all the relevant review comments and latest fully working version of the code. I'd suggest starting there. Unfortunately, one of the reasons this stalled out is there are some challenging design and implementation details to work out. In particular around reconciling file extents for uncommitted dirty data blocks and pending frees which have been modified in the not-yet-synced-to-disk transaction groups. While I wouldn't want to discourage any one from taking on this work, getting it right is going to be a little bit tricky and require some solid knowledge of the internals.
I wish I could point you at a primer, but I'm not aware of anything along those lines. There is a fair amount of high level design documentation available, but only the code is really going to have the level of detail needed.
@behlendorf Thanks ok I understand someone new to zfs will need read through the code carefully and learn it that way. I will try get someone to take a look, wish I hadn't ditched class to hack gopher systems in library...so I wouldn't have to 'hire' coders.
As far as FIEMAP I knew it provided extents interface to underlying blocks but I thought if we're going to use the Linux VFS ioctl FICLONE, FICLONERANGE and FIDEDUPERANGE, then I thought we had to speak "extents" through those ioctls since I thought they didn't talk blocks and we/zfs weren't extent based already since I saw this FIEMAP issue outstanding. I thought we could leverage existing reflink, dedupe, and copy_file_range code in the Linux kernel via those ficlone/ficlonerange/fideduperange and lessen the work on zfs side. I thought that @pjd had created some "custom ioctl" for zfs maybe and wasn't using ficlone.
BRT I had thought was like the XFS reference count B+tree for reflinks but maybe I'm wrong? At least for zfs I hear it can't be as straight-forward as it seems on XFS? But I can't figure out exactly why, as to what complicates things for us compared to xfs etc.
What was really depressing me a bit much was hearing it would basically be impossible for us to preserve reflinks over zfs send/receive and when I read that supposedly BTRFS "can" preserve their reflinks over btrfs send/receive, I was like "how"?
On a related note, will FIEMAP (and filefrag support) help us get better tools to "defrag" files via maybe background low 'scrub' priority rewriting? (I guess this could be done now with zdb zfs frag check utils?) I'm also looking for a good write-up as to why Block Pointer Rewrite is so difficult and why its really the "only way" some have said, to deal with long-term fragmentation.
Also know of anything about the Panzura ZFS dedupe tech that was supposed to be coming to ZFS? (Apologies for the long post and questions, you can ignore if busy, no hurry. thx)
@behlendorf, if one of the main obstacles for landing FIEMAP support is the "inflight writes" handling (FIEMAP_EXTENT_DELALLOC), a relatively straight forward solution would be to always force FIEMAP_FLAG_SYNC for all files. Not the greatest for performance, but this should be implemented as a no-op for files that don't have any dirty pages (i.e. most files, and should not trigger a needless TXG sync/wait) then it would still be very useful for the majority of cases. For Lustre we ended up doing this because getting pending in-memory write information from all clients is nigh impossible. If there are concurrent write and FIEMAP calls on a single file then there will always be a race between what FIEMAP reports and the current state of the file that cannot be closed from userspace, so I don't think that should be a reasonable goal for this patch.
FIEMAP isn't used so often on still-dirty files, and in the cases that it is most users probably care about the on-disk allocations and not that X MB of unwritten data is still pending in memory. For those few that care about unwritten data, the complete lack of FIEMAP is equally non-functional, and that could be added incrementally as needed after the core on-disk allocation reporting is available.
is there any update on this?
Indeed, any update on this?
rebased (badly, but compiles and works) on commit d98973dbdd5a85b6c8a8556d5bd5c9903e2d2ee6 if anyone wants to clean it up, add back the tests and make a proper PR: fiemap-diff.txt
Note that there is development again on the upstream kernel to add compressed extent support for FIEMAP:
https://lore.kernel.org/linux-fsdevel/[email protected]/ https://lore.kernel.org/linux-fsdevel/2befe2c13065bdf3ca74cb8b701727940310fd2a.1712126039.git.sweettea-kernel@dorminy.me/
The patches so far are using the fe_reserved[0] field for fe_physical_length, which matches this patch usage. It will also set the flag FIEMAP_EXTENT_DATA_COMPRESSED on compressed extents, in addition to FIEMAP_EXTENT_ENCODED:
/* Data is compressed by fs. Sets EXTENT_ENCODED. */
#define FIEMAP_EXTENT_DATA_COMPRESSED 0x00000040
Note that this is all still under discussion and subject to change.