daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-15998 pool: Let all pool_svc rest on ds_pool

Open liw opened this issue 1 year ago • 2 comments

Currently, only leader pool_svc objects depend on ds_pool objects. This unintentionally allows the creation of a PS replica even when the local ds_pool object is being stopped, leading to the hang in the Jira ticket. This patch lets every follower pool_svc object depend on the ds_pool object too, so that if the latter is stopping or doesn't exist, the pool_svc object won't be created successfully.

  • Initialize pool_svc.ps_pool when allocating pool_svc, rather than when stepping up.

    • Move ps_pool up in pool_svc, as it's no longer a leader-only field.
    • Modify init_svc_pool accordingly and rename it to update_svc_pool.
  • Move ds_pool_svc_stop into ds_pool_stop, because we want to set ds_pool.sp_stopping before stopping the PS (if any).

    • Move the PS start code into ds_pool_start for symmetry.
    • Clean up the error handling in and for ds_pool_svc_stop.
  • Change ds_pool_stop_all to stop all pools concurrently.

    • Remove ds_rsvc_stop_all, as it's no longer used anywhere.

Features: pool

Before requesting gatekeeper:

  • [ ] Two review approvals and any prior change requests have been resolved.
  • [ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • [ ] Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • [ ] Commit messages follows the guidelines outlined here.
  • [ ] Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • [ ] You are the appropriate gatekeeper to be landing the patch.
  • [ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • [ ] Githooks were used. If not, request that user install them and check copyright dates.
  • [ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • [ ] All builds have passed. Check non-required builds for any new compiler warnings.
  • [ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • [ ] If applicable, the PR has addressed any potential version compatibility issues.
  • [ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • [ ] Extra checks if forced landing is requested
    • [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • [ ] No new NLT or valgrind warnings. Check the classic view.
    • [ ] Quick-build or Quick-functional is not used.
  • [ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

liw avatar Jul 02 '24 07:07 liw

Ticket title is '2-./rebuild/cascading_failures.py:RbldCascadingFailures.test_sequential_failures test fails during system stop with *control.SystemStopReq request timed out after 5m0s"' Status is 'In Progress' Labels: 'ci_impact,scrubbed_2.8,triaged,weekly_test' https://daosio.atlassian.net/browse/DAOS-15998

github-actions[bot] avatar Jul 02 '24 07:07 github-actions[bot]

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14681/1/testReport/

[liw] https://daosio.atlassian.net/browse/DAOS-16153

daosbuild1 avatar Jul 02 '24 08:07 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14681/1/execution/node/1557/log

[liw] https://daosio.atlassian.net/browse/DAOS-15608

daosbuild1 avatar Jul 07 '24 18:07 daosbuild1

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14681/1/testReport/

[liw] https://daosio.atlassian.net/browse/DAOS-16153

@liw That doesn't look like the same issue to me. The ticket is for NLT time out, but this failure was valgrind errors

daltonbohning avatar Jul 16 '24 14:07 daltonbohning

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14681/1/testReport/ [liw] https://daosio.atlassian.net/browse/DAOS-16153

@liw That doesn't look like the same issue to me. The ticket is for NLT time out, but this failure was valgrind errors

@daltonbohning, you're right. I think I've figured out how to reach the errors, and they look like https://daosio.atlassian.net/browse/DAOS-16228. In any case, they are very unlikely related to this PR, which only changes engine-side code.

  <unique>0xe2</unique>
  <tid>5</tid>
  <kind>InvalidWrite</kind>
  <what>Invalid write of size 4</what>
  <stack>
    <frame>
      <ip>0xC97F65</ip>
      <obj>/opt/daos/bin/daos</obj>
      <fn>__tsan_go_atomic32_store</fn>
    </frame>
    <frame>
      <ip>0x494679</ip>
      <obj>/opt/daos/bin/daos</obj>
      <fn>racecall</fn>
      <dir>/usr/src/runtime</dir>
      <file>race_amd64.s</file>
      <line>396</line>
    </frame>
  </stack>
  <auxwhat>Address 0xc00035a6c8 is in a rw- anonymous segment</auxwhat>
  <suppression>
    <sname>insert_a_suppression_name_here</sname>
    <skind>Memcheck:Addr4</skind>
    <sframe> <fun>__tsan_go_atomic32_store</fun> </sframe>
    <sframe> <fun>racecall</fun> </sframe>
    <rawtext>
<![CDATA[
{
   <insert_a_suppression_name_here>
   Memcheck:Addr4
   fun:__tsan_go_atomic32_store
   fun:racecall
}
]]>
    </rawtext>
  </suppression>

  <unique>0x51a</unique>
  <tid>5</tid>
  <kind>InvalidRead</kind>
  <what>Invalid read of size 4</what>
  <stack>
    <frame>
      <ip>0xC98A93</ip>
      <obj>/opt/daos/bin/daos</obj>
      <fn>__tsan_go_atomic32_load</fn>
    </frame>
    <frame>
      <ip>0x494679</ip>
      <obj>/opt/daos/bin/daos</obj>
      <fn>racecall</fn>
      <dir>/usr/src/runtime</dir>
      <file>race_amd64.s</file>
      <line>396</line>
    </frame>
  </stack>
  <auxwhat>Address 0xc00035a6c8 is in a rw- anonymous segment</auxwhat>
  <suppression>
    <sname>insert_a_suppression_name_here</sname>
    <skind>Memcheck:Addr4</skind>
    <sframe> <fun>__tsan_go_atomic32_load</fun> </sframe>
    <sframe> <fun>racecall</fun> </sframe>
    <rawtext>
<![CDATA[
{
   <insert_a_suppression_name_here>
   Memcheck:Addr4
   fun:__tsan_go_atomic32_load
   fun:racecall
}
]]>
    </rawtext>
  </suppression>

  <unique>0x51b</unique>
  <tid>5</tid>
  <kind>InvalidRead</kind>
  <what>Invalid read of size 4</what>
  <stack>
    <frame>
      <ip>0xC98B2E</ip>
      <obj>/opt/daos/bin/daos</obj>
      <fn>__tsan_go_atomic32_load</fn>
    </frame>
    <frame>
      <ip>0x494679</ip>
      <obj>/opt/daos/bin/daos</obj>
      <fn>racecall</fn>
      <dir>/usr/src/runtime</dir>
      <file>race_amd64.s</file>
      <line>396</line>
    </frame>
  </stack>
  <auxwhat>Address 0xc00035a6c8 is in a rw- anonymous segment</auxwhat>
  <suppression>
    <sname>insert_a_suppression_name_here</sname>
    <skind>Memcheck:Addr4</skind>
    <sframe> <fun>__tsan_go_atomic32_load</fun> </sframe>
    <sframe> <fun>racecall</fun> </sframe>
    <rawtext>
<![CDATA[
{
   <insert_a_suppression_name_here>
   Memcheck:Addr4
   fun:__tsan_go_atomic32_load
   fun:racecall
}
]]>
    </rawtext>
  </suppression>

liw avatar Jul 17 '24 00:07 liw