zfs icon indicating copy to clipboard operation
zfs copied to clipboard

WIP - ZAP Shrinking

Open sdimitro opened this issue 3 years ago • 2 comments

Ported from:

  • https://illumos.topicbox-beta.com/groups/zfs/Tdc09a8d439d09c9f
  • https://code.illumos.org/c/illumos-gate/+/1580/11/usr/src/uts/common/fs/zfs/zap.c
/*
 * Shrinking Algorithm:
 * 1. Check if a sibling leaf exists.
 * 2. Check if the sibling leaf is empty.
 * 3. If sibling bit of initial leaf is not 0 release it.
 * In order to avoid deadlock, we have to ensure dereferencing leaves in same
 * order - the leaf with sibling bit 0 first, then the leaf with sl_bit 1.
 * 4. Upgrade zapdir lock to WRITER (once).
 * 5. Deref leaves if needed.
 * 6. Recheck both leaves if required.
 * 7. Update ptrtbl pointes of the sibling leaf (sl_bit 1) to point to
 * the initial leaf (sl_bit 0).
 * 8. Free disk space of the sibling leaf (dmu_free_range).
 * 9. Update the leaf prefix and prefix_len
 * 10. Repeat the procedure from beginning to the updated leaf.
 *
 *		+---------------+
 *		| fzap_remove() |
 *		+---------------+
 *			|
 *			v
 *		+---------------+
 *		| zap_shrink()  |
 *		+---------------+
 *		        |
 *		        v
 *		+================+
 *	        < is leaf empty? >---(no)---> OUT
 *		+================+
 *			|
 *		      (yes)
 *		        |
 *	+------->-------+
 *	|	        |
 *	|	        v
 *	|	+---------------------------+
 *	|	| check_sibling_by_ptrtbl() |
 *	|	+---------------------------+
 *	|	        |
 *	|	        v
 *	|	+====================+
 *	|       < sibl. leaf exists? >---(no)---> OUT
 *	|	+====================+
 *	|	      (yes)
 *	|	        |
 *	|	        v
 *	|	+---------------------------+
 *	|	| deref sibl. leaf (READER) |
 *	|	+---------------------------+
 *	|	        |
 *	|	        v
 *	|	+===================+
 *	|       < is sibling empty? >---(no)---> OUT
 *	|	+===================+
 *	|		|
 *	|	      (yes)
 *	|	        |
 *	|	        v
 *	|	+-------------------------------------+
 *	|	| put sibl. leaf cause we need writer |
 *	|	+-------------------------------------+
 *	|		|
 *	|		v
 *	|	+============================+		+-----------------+
 *	|        < do we hold zap as WRITER? >--(no)--> | tryupgradedir() |
 *	|	+============================+		+-----------------+
 *	|		|				       |
 *	|		|				       v
 *	|		|				+==============+
 *	|		|<--------(yes)-----------------<   success?   >
 *	|		|				+==============+
 *	|		|				       |
 *	|		|				      (no)
 *	|		|				       |
 *	|		|				       v
 *	|		|				+--------------+
 *	|		|				| upgrade dir  |
 *	|		|				+--------------+
 *	|		|				       |
 *	|		|				       v
 *	|		|<-------------------------------------+
 *	|		|
 *	|	+-----------------------------------------------+
 *	|	| swap leaf hashes if initial leaf had slbit==1 |
 *	|	| make sure: l (slbit==0), sl (slbit==1)        |
 *	|	+-----------------------------------------------+
 *	|		|
 *	|		v
 *	|	+--------------------------------------+
 *	|	| deref sibl. leaf (WRITER) if required|
 *	|	+--------------------------------------+
 *	|		|
 *	|		v
 *	|	+---------------------------+
 *	|	| deref sibl. leaf (WRITER) |
 *	|	+---------------------------+
 *	|		|
 *	|		v
 *	|	+===========================================+
 *	|	< (recheck) both leaves are empty siblings) >--(no)--> OUT
 *	|	+===========================================+
 *	|		|
 *	|	      (yes)
 *	|	        |
 *	|	        v
 *	|	+----------------------------------+
 *	|	| update sibling leaf ptrtbl range |
 *	|	| to point to initial leaf	   |
 *	|	+----------------------------------+
 *	|		|
 *	|		v
 *	|	+--------------------------------------+
 *	|	| free disk space for the sibling leaf |
 *	|	+--------------------------------------+
 *	|		|
 *	|		v
 *	|	+---------------------------------------+
 *	|	| update initial leaf prefix/prefix_len |
 *	|	| (now this leaf goes to another level,	|
 *	|	|  and it may have another sibling	|
 *	|	+---------------------------------------+
 *	|		|
 *	+---------------+
 */

sdimitro avatar Oct 25 '22 23:10 sdimitro

This is awesome! As far as i understood @ahrens on the OpenZFS DevSummit Hackathon 2022 correctly, this could finally fix long waits for simple "ls" on dirs, which contained many files in the past. In case this gets merged, does this only "work" with only new created dirs or also with already existing dirs after upgrading zfs to new release which contains this feature?

Cant wait to get that upstreamed! :) Thanks in advance for all contributors!

Referencing a github discussion regarding zap shrinking: https://github.com/openzfs/zfs/discussions/8420

jumbi77 avatar Oct 30 '22 08:10 jumbi77

@sdimitro Sorry for bothering you, can you may adress the feedback and rebase this? I am really looking forward to this. @behlendorf Can you may ping some additional reviewers if required (yourself)?

Much thanks in advance for all participants!

jumbi77 avatar Nov 12 '22 20:11 jumbi77

Politely pinging @amotin since recent work on ZAP code and to may bring more attention/review to this. Just in case iX is may interested in this.

To get that integrated would be awesome. Anyway much thanks!

jumbi77 avatar Jan 19 '23 19:01 jumbi77

@jumbi77 It is interesting, but so far I've worked with MicroZAP's, not so much FatZAP's handled here, as I see. I'd need to dig deeper into it. @sdimitro Is this PR abandoned or you plan to return?

amotin avatar Jan 19 '23 19:01 amotin

@amotin Feel free to pick this up!

BTW I was planning on trying out a few other designs as this PR is not really my code (I uncovered it from an old illumos PR). Maybe something along the lines of recreating a the whole ZAP once too many entries are gone (1/4?) potentially converting it to a microZAP too.

sdimitro avatar Jan 19 '23 20:01 sdimitro

@allanjude I saw your recent presentation on the june 2023 OpenZFS leadership meeting regarding the "rework" dedup stuff. There (as far as i understood), you also mentioned some ZAP optimizations, including ZAP shrinking. Do you plan to use this PR or do you even/may consider finishing this PR before/separate from the dedup stuff? Getting ZAP shrinking/optimizations would be great. In any case much thanks.

jumbi77 avatar Jul 04 '23 20:07 jumbi77

We expect to post an updated version of ZAP shrinking in the next week or two.

allanjude avatar Jul 05 '23 00:07 allanjude

We expect to post an updated version of ZAP shrinking in the next week or two.

Hello @allanjude, can I may ask the progress on this/can you may give an update on this?

In any case, much thanks for working on zfs!

jumbi77 avatar Sep 16 '23 12:09 jumbi77

I'd just like to say here, that this code seems solid so far. We've been running it (after a few testing rounds) in production for a while now, no problems to chase, just works. I think there are quite a number of users, who would benefit from this being upstreamed. Not sure it's worth the wait for a newer version... That can eventually still land in master even after this gets merged, can't it?

(also sorry for the spam, I'll remove the reference to this PR from the commit)

snajpa avatar Jan 23 '24 22:01 snajpa

@snajpa it's good to know this is holding up well in your testing. There's still some outstanding feedback to tackle, that work just needs to be picked up by someone and a fresh PR opened.

behlendorf avatar Jan 26 '24 22:01 behlendorf

@snajpa it's good to know this is holding up well in your testing. There's still some outstanding feedback to tackle, that work just needs to be picked up by someone and a fresh PR opened.

Klara's improved version of ZAP shrinking should get a pull request before the end of February.

allanjude avatar Jan 27 '24 13:01 allanjude

Replaced by #15888

behlendorf avatar Feb 15 '24 19:02 behlendorf