daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-11227 pool: Retry map_refresh on more errors

Open liw opened this issue 3 years ago • 1 comments

POOL_TGT_QUERY_MAP RPCs may encounter a large number of different remote errors. For instance, from a restarted-but-not-yet-reintegrated engine, we may get -DER_NO_HDL; Zhao Zhen has also observed -DER_OOG. For those errors that are not normally retryable, this patch lets map_refresh retry a limited number of times and fall back to dc_pool_query:

  • To use dc_pool_query, dc_pool_create_map_refresh_task has to take a pool handle instead of a dc_pool object.

  • Tune the backoff sequence of a map_refresh task a bit for hopefully better scalability.

  • Add a new daos_test case for the new fallback mechanism. And, bump the corresponding test timeout a little, since this new test involves a rebuild/reintegration cycle.

Signed-off-by: Li Wei [email protected] Required-githooks: true

liw avatar Aug 09 '22 11:08 liw

Bug-tracker data: Ticket title is 'dfuse got error DER_NO_PERM after kill one engine' Status is 'In Progress' Labels: 'daily_test,triaged' https://daosio.atlassian.net/browse/DAOS-11227

github-actions[bot] avatar Aug 09 '22 11:08 github-actions[bot]