daos DAOS-11146 client: reduce lock contention on hash lock

1. switch one global lock to per bucket lock.
to reduce lock contention for multiple threads.

2. switch spinlock to read/write lock on the client side,
add following adjustment to apply read/write lock:

 2.1: use atomic counter to implement rl_op_addref()/rl_op_decref()
     because getref/putref switch to read lock.
 2.2: remove D_HASH_FT_LRU feat, otherwise we could not hold
     read lock for finding operation.

3.  Refine d_hhash_link_insert() - for PTR type handle hash table, need not take
     any extra lock for non-PTR type, refine d_hash_rec_insert_anonym() 
     add feat bit D_HASH_FT_NO_KEYINIT_LOCK set that for d_hhash
     to avoid take all buckets' lock Addd a simple test fo obj_open perf, 
     by testing on boro obj_open per second increased from 427.36 OPs/S to 313038.03 OPs/S.
     (./daos_test -i -u 46, only one engine and client is enough) (From Xuezhao)

Single client fio 4k random write testing(4 daos servers):

         before                after

64thread  26.5k iops                173k iops

64process 2179k iops                2374k iops

Signed-off-by: Wang Shilong [email protected] Signed-off-by: Xuezhao Liu [email protected]

Apr 13 '22 01:04 wangshilong

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/2/execution/node/376/log

Apr 15 '22 15:04 daosbuild1

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/2/execution/node/344/log

Apr 15 '22 15:04 daosbuild1

Test stage Build RPM on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/2/execution/node/394/log

Apr 15 '22 15:04 daosbuild1

Test stage Build RPM on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/2/execution/node/391/log

Apr 15 '22 15:04 daosbuild1

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-8704/3/display/redirect

Apr 18 '22 04:04 daosbuild1

Test stage Functional Hardware Small completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-8704/3/display/redirect

Apr 18 '22 04:04 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-8704/3/display/redirect

Apr 18 '22 05:04 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/5/execution/node/122/log

Apr 28 '22 09:04 daosbuild1

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/6/execution/node/404/log

Apr 28 '22 15:04 daosbuild1

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/6/execution/node/358/log

Apr 28 '22 15:04 daosbuild1

Test stage Build RPM on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/6/execution/node/389/log

Apr 28 '22 15:04 daosbuild1

Test stage Build RPM on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/6/execution/node/386/log

Apr 28 '22 15:04 daosbuild1

Bug-tracker data: Ticket title is 'Single DAOS client IOPS performance benchmarking' Status is 'In Review' Job should run at elevated priority (2) https://daosio.atlassian.net/browse/DAOS-11146

Jul 15 '22 14:07 github-actions[bot]

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/7/execution/node/371/log

Jul 15 '22 14:07 daosbuild1

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/7/execution/node/368/log

Jul 15 '22 15:07 daosbuild1

Test stage Build RPM on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/7/execution/node/364/log

Jul 15 '22 15:07 daosbuild1

@wangshilong You need to address the compiling issues on several OS.

Jul 18 '22 03:07 wiliamhuang

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/9/execution/node/119/log

Aug 04 '22 13:08 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/11/execution/node/144/log

Aug 05 '22 08:08 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/12/execution/node/138/log

Aug 08 '22 01:08 daosbuild1

The atomics side looks good but I'm trying to work through the other implications.

This will mean the "client" hash table (I'm not sure where that's used) will now use RW locks, so lookups will happen in parallel and hence need atomics but decref will potentially need to modify so all happen in serial?

The main change here is client handle hash lock, this is used heavily by event queue. event queue lookup, polling is a very hot operation during IO. so holding a spinlock really hurts performance, and this problem is very obvious for multiple threads. because handle hash table is per process.

"atomics" might be better for spinlock? i am not sure, but benchmarking still showed performance burst from 26.5k to 173K. so i guess this is a good thing to go for DFS.

I've been looking at something similar for dfuse in #8855 trying to move to RW locks, but the downside is that there's no LRU. For dfuse we also have some extra data - in most cases the kernel holds a reference so any decref that's not from the kernel cannot cause a free so in terms of RW locking we can do most decrefs with R locks except on one or two code-paths, this could change things dramatically but means both changes to the htable API but also dropping the reference before replying to kernel requests.

I agreed, dfuse might be different story, and i think this PR don't touch dfuse parts.

Aug 08 '22 01:08 wangshilong

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/14/execution/node/145/log

Aug 08 '22 09:08 daosbuild1

switch spinlock to read/write lock on the client side, and using ATOMIC for addref/decref looks fine to me. For "switch one global lock to per bucket lock", one point is the daos_hhash_link_insert() will be affected largely, will change from 1 global lock to 64K bucket locks for every daos_hhash_link_insert() operation that is a pretty high number, I am not sure how deep will it affect performance, for example obj_open() perf. If still use global lock, will it affect performance largely?

On the other hand, I'm not very sure why d_hhash_link_insert() need to hold all buckets' lock? just because it does not know the hlink->hl_key? but for handle hash table (struct d_hhash), the key is just 64bits ch_cookie, see hh_op_key_init(), if change ch_cookie to be ATOMIC then d_hhash's ch_key_init() need not to be protected by lock, and seems can avoid to take all buckets' lock in d_hhash_link_insert(). But to differentiate d_hhash from other hash table may need to define another feat bit like "D_HASH_FT_NO_KEYINIT_LOCK" or sth else.

Aug 09 '22 08:08 wangshilong

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/17/execution/node/138/log

Aug 09 '22 08:08 daosbuild1

switch spinlock to read/write lock on the client side, and using ATOMIC for addref/decref looks fine to me. For "switch one global lock to per bucket lock", one point is the daos_hhash_link_insert() will be affected largely, will change from 1 global lock to 64K bucket locks for every daos_hhash_link_insert() operation that is a pretty high number, I am not sure how deep will it affect performance, for example obj_open() perf. If still use global lock, will it affect performance largely?

 Yup, i tried to reduce buckets locks to 64/128 rather than 64K, because it looks too much for me too, but testing showed

64k buckets performance is best. For fio specific test case, without bucket locks, fio reduced to 50%~70% of max performance.

 Do you have tools/ways how i could benchmark obj_open() performances?

On the other hand, I'm not very sure why d_hhash_link_insert() need to hold all buckets' lock? just because it does not know the hlink->hl_key? but for handle hash table (struct d_hhash), the key is just 64bits ch_cookie, see hh_op_key_init(), if change ch_cookie to be ATOMIC then d_hhash's ch_key_init() need not to be protected by lock, and seems can avoid to take all buckets' lock in d_hhash_link_insert(). But to differentiate d_hhash from other hash table may need to define another feat bit like "D_HASH_FT_NO_KEYINIT_LOCK" or sth else.

Yup, this is good point, i can try, but i guess this is not super helpful for fio cases, since now holding write locks are few cases.

Aug 09 '22 08:08 wangshilong

Yup, this is good point, i can try, but i guess this is not super helpful for fio cases, since now holding write locks are few cases.

I am fine to land it as it for now, but I think if for each "obj_open() -> daos_hhash_link_insert()" need to take 64K bucket locks it looks not very good and unnecessary. But can be refined in a following patch, I can work on a patch as what i suggested later if needed. to be clear, I mean it is fine to use per bucket lock, just that may try to avoid take all buckets lock for hhash's link insert if possible.

Aug 09 '22 08:08 liuxuezhao

Yup, this is good point, i can try, but i guess this is not super helpful for fio cases, since now holding write locks are few cases.

I am fine to land it as it for now, but I think if for each "obj_open() -> daos_hhash_link_insert()" need to take 64K bucket locks it looks not very good and unnecessary. But can be refined in a following patch, I can work on a patch as what i suggested later if needed.

I indeed have a local patch to do this, i could push this.

Aug 09 '22 08:08 wangshilong

I indeed have a local patch to do this, i could push this.

that's great, thanks.

Aug 09 '22 08:08 liuxuezhao

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8704/18/execution/node/139/log

Aug 09 '22 09:08 daosbuild1

Yup, this is good point, i can try, but i guess this is not super helpful for fio cases, since now holding write locks are few cases.

I am fine to land it as it for now, but I think if for each "obj_open() -> daos_hhash_link_insert()" need to take 64K bucket locks it looks not very good and unnecessary. But can be refined in a following patch, I can work on a patch as what i suggested later if needed. to be clear, I mean it is fine to use per bucket lock, just that may try to avoid take all buckets lock for hhash's link insert if possible.

@liuxuezhao I pushed PR to limit number of bucket locks. I wanted to re-benchmark it again, but unfortunately i did not have environment again, but as far as i remembered with limiting PR, performance drops a bit compared to max i could get, but a re-think 64k don't make too much sense i agreed.

Let's push your following idea if that really become a performance issue, as i am not favor of adding more feats now unless we confirmed there is performance issues there.

Aug 09 '22 09:08 wangshilong