daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17598 vos: misc enhancements for handling DTX conflict - b26

Open Nasf-Fan opened this issue 7 months ago • 22 comments

  1. Use IT_FOR_CHECK when iterate OBJ for DDB dv_path_verify

From DDB perspective, the target with non-aborted DTX should be visible even if related DTX is prepared locally but not globally committed yet. It is the caller's duty to decide how to handle non-committed target in subsequent logic, or DTX resync will handle such DTX sometime later.

DDB logic will use DAOS_INTENT_CHECK instead of DAOS_INTENT_DEFAULT to indicate above purpose when iterate OBJ during dv_path_verify().

  1. Do not create VOS LRU metrics entry when initialize standalone tls. That will avoid confused error message from DDB utils.

  2. Dump DTX information when conflict being detected.

  3. Introduce VOS diagnose mode to allow pool/container to be opened even if there is some corruption, then related issue may be fixed or handled via subsequent operations. It is controlled via server side environment "DAOS_DIAG_MODE". It is set as zero by default. If user wants to handle potential VOS corruption when open the pool/container, then set it as 1 for "check" or 2 for "repair" explicitly when start the engine. If some corruption is detected under "check" mode, then related inconsistency or corruption will be dump, but opening related pool/container will finally fail to avoid further damaging the system.

  4. Verify active DTX validity when reindex them, filter out invalid ones. Reassign new local ID if hit conflict DTX if DAOS_DIAG_MODE set as "repair" mode. Then subsequent DTX resync can handle them globally.

  5. Do not clear "dae_need_release" flags after partial commit to avoid DTX entry being evicted from DRAM but left on-disk.

Signed-off-by: Fan Yong [email protected]

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

Nasf-Fan avatar May 28 '25 07:05 Nasf-Fan

Errors are Unable to load ticket data https://daosio.atlassian.net/browse/DAOS-17598

github-actions[bot] avatar May 28 '25 07:05 github-actions[bot]

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16445/2/testReport/

daosbuild3 avatar May 28 '25 08:05 daosbuild3

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16445/2/testReport/

daosbuild3 avatar May 28 '25 08:05 daosbuild3

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16445/3/display/redirect

daosbuild3 avatar May 28 '25 11:05 daosbuild3

Test stage NLT on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16445/3/display/redirect

daosbuild3 avatar May 28 '25 11:05 daosbuild3

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16445/4/testReport/

daosbuild3 avatar May 28 '25 12:05 daosbuild3

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16445/5/testReport/

daosbuild3 avatar May 28 '25 15:05 daosbuild3

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16445/7/testReport/

daosbuild3 avatar May 28 '25 18:05 daosbuild3

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16445/8/testReport/

daosbuild3 avatar May 29 '25 04:05 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16445/9/testReport/

daosbuild3 avatar May 29 '25 13:05 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16445/10/testReport/

daosbuild3 avatar May 29 '25 14:05 daosbuild3

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16445/11/testReport/

daosbuild3 avatar May 30 '25 03:05 daosbuild3

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16445/11/testReport/

daosbuild3 avatar May 30 '25 04:05 daosbuild3

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16445/12/testReport/

daosbuild3 avatar Jun 01 '25 05:06 daosbuild3

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16445/12/testReport/

daosbuild3 avatar Jun 01 '25 05:06 daosbuild3

Test stage Functional Hardware Large completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16445/12/display/redirect

daosbuild3 avatar Jun 01 '25 08:06 daosbuild3

Test stage Functional Hardware Medium completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16445/12/execution/node/1393/log

daosbuild3 avatar Jun 01 '25 15:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16445/12/execution/node/1483/log

daosbuild3 avatar Jun 01 '25 23:06 daosbuild3

Test stage Functional Hardware Large completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16445/13/display/redirect

daosbuild3 avatar Jun 02 '25 05:06 daosbuild3

Test stage Functional Hardware Medium completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16445/13/testReport/

daosbuild3 avatar Jun 02 '25 08:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16445/13/testReport/

daosbuild3 avatar Jun 02 '25 12:06 daosbuild3

test_ior_small failed for DAOS-17573, not related with the patch.

Nasf-Fan avatar Jun 03 '25 01:06 Nasf-Fan

Not need it any longer.

Nasf-Fan avatar Aug 08 '25 11:08 Nasf-Fan