zfs icon indicating copy to clipboard operation
zfs copied to clipboard

Dbuf hash revert

Open behlendorf opened this issue 1 year ago • 5 comments

Motivation and Context

On systems capable of sustaining a very-high dbuf insertion/removal rate a peak performance regression was observed for some workloads.

Description

This reverts commit 34dbc618f50cfcd392f90af80c140398c38cbcd1. While this change resolved the lock contention observed for certain workloads, it inadvertently reduced the maximum hash inserts/removes per second. This appears to be due to the slightly higher acquisition cost of a rwlock vs a mutex.

Incorrectly sizing the array of hash locks used to protect the dbuf hash table can lead to contention and reduce performance. We could unconditionally allocate a larger array for the locks but it's wasteful particularly for a low-memory system. Instead, dynamically allocate the array of locks and scale it based on total system memory (On x86_64, 64K used per 1GB of memory).

Additionally, add a new dbuf_mutex_hash_shift module option which can be used to override the hash lock array size. This is disabled by default (dbuf_mutex_hash_shift=0) and can only be set at module load time. The minimum target array size is set to 8192, this matches the current constant value, which means only on larger memory systems will the memory footprint be slightly increased.

Note that the count of the dbuf hash table and count of the mutex array were added to the /proc/spl/kstat/zfs/dbufstats kstat.

Finally, this change removes the _KERNEL conditional checks. These were not required since for the user space build there is no difference between the kmem and vmem interfaces.

How Has This Been Tested?

Locally with ztest and basic sanity I/O testing.

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [x] Performance enhancement (non-breaking change which improves efficiency)
  • [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • [ ] Documentation (a change to man pages or other documentation)

Checklist:

  • [x] My code follows the OpenZFS code style requirements.
  • [x] I have updated the documentation accordingly.
  • [x] I have read the contributing document.
  • [ ] I have added tests to cover my changes.
  • [ ] I have run the ZFS Test Suite with this change applied.
  • [x] All commit messages are properly formatted and contain Signed-off-by.

behlendorf avatar Sep 19 '22 22:09 behlendorf

minor typos

-    but is wasteful particularly for a low-memory system.  Instead,
+    but it's wasteful particularly for a low-memory system.  Instead,

-    which can be used override the hash lock array size.  This is
+    which can be used to override the hash lock array size.  This is

tonyhutter avatar Sep 20 '22 00:09 tonyhutter

Can you provide a flamegraph before and after? Chances are the rw locking does provide a speed up on their own, but it increases contention elsewhere and that's where the slowdown in total is coming from.

Is there a one-liner to repro?

mjguzik avatar Sep 20 '22 22:09 mjguzik

Bonus question: is the hashing func any good for this data?

mjguzik avatar Sep 20 '22 23:09 mjguzik

Can you provide a flamegraph before and after? Chances are the rw locking does provide a speed up on their own, but it increases contention elsewhere and that's where the slowdown in total is coming from.

I'd like to, but unfortunately I don't have access to the system and the test workload involved running mdtest of Lustre which is a fairly involved thing to setup. We were able to verify that reverting the original patch resolves the performance regression, so I'd like to move forward with this.

Bonus question: is the hashing func any good for this data?

Now that's an excellent question, in general yes, but there are some surprising hot spots in the distribution. Even with a very large hash table (256GB of memory on the node) we see 77 or so hash chains of length 2 or more, and a worst case of 63! Again this is with a create/unlink heavy workload through Lustre. I've also seen unfortunate hash distributions when creating a pool with only a small number (or one) zvol since only the block id will vary. When accessed via the ZPL layer I've seen a much fairer distributions.

hash_chains                     4    77
hash_chain_max                  4    63

behlendorf avatar Sep 22 '22 00:09 behlendorf

@amotin @freqlabs any remaining concerns. I would like to more carefully profile this, but I'd also to get this merged to master so we can get it pulled in to the 2.1.6 PR.

behlendorf avatar Sep 22 '22 00:09 behlendorf

@amotin thanks, my mistake, I fixed both of those items locally but seem to somehow pushed the wrong version. Updated.

behlendorf avatar Sep 22 '22 16:09 behlendorf

Can you provide a flamegraph before and after? Chances are the rw locking does provide a speed up on their own, but it increases contention elsewhere and that's where the slowdown in total is coming from.

I'd like to, but unfortunately I don't have access to the system and the test workload involved running mdtest of Lustre which is a fairly involved thing to setup. We were able to verify that reverting the original patch resolves the performance regression, so I'd like to move forward with this.

They should be able to provide flamegraphs on their own then.

Bonus question: is the hashing func any good for this data?

Now that's an excellent question, in general yes, but there are some surprising hot spots in the distribution. Even with a very large hash table (256GB of memory on the node) we see 77 or so hash chains of length 2 or more, and a worst case of 63! Again this is with a create/unlink heavy workload through Lustre. I've also seen unfortunate hash distributions when creating a pool with only a small number (or one) zvol since only the block id will vary. When accessed via the ZPL layer I've seen a much fairer distributions.

hash_chains                     4    77
hash_chain_max                  4    63

Similarly, it should be possible to dump their data so that it can be hashed by other algos for comparison.

mjguzik avatar Sep 22 '22 17:09 mjguzik

No argument here, we plan to profile this further and will have access to a similar setup in the coming months.

behlendorf avatar Sep 22 '22 18:09 behlendorf