daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17639 test: Detect all server fabric_ifaces

Open phender opened this issue 3 months ago • 38 comments

Launch.py will detect all of the fastest interfaces common to all the specified server hosts and use them to populate the engine fabric_iface entries if no overrides are provided in the test yaml.

Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: IorSmall

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

phender avatar Sep 25 '25 23:09 phender

Ticket title is 'Support newly named ib devices for functional tests' Status is 'In Review' Labels: 'testp1' https://daosio.atlassian.net/browse/DAOS-17639

github-actions[bot] avatar Sep 25 '25 23:09 github-actions[bot]

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/4/execution/node/557/log

daosbuild3 avatar Sep 25 '25 23:09 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/3/execution/node/805/log

daosbuild3 avatar Sep 26 '25 00:09 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/6/execution/node/747/log

daosbuild3 avatar Sep 26 '25 04:09 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/5/execution/node/805/log

daosbuild3 avatar Sep 26 '25 06:09 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/7/execution/node/805/log

daosbuild3 avatar Sep 26 '25 15:09 daosbuild3

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/10/execution/node/665/log

daosbuild3 avatar Sep 27 '25 07:09 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/11/execution/node/557/log

daosbuild3 avatar Sep 27 '25 08:09 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/14/execution/node/895/log

daosbuild3 avatar Oct 03 '25 03:10 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/14/execution/node/954/log

daosbuild3 avatar Oct 03 '25 14:10 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/14/execution/node/909/log

daosbuild3 avatar Oct 03 '25 20:10 daosbuild3

Failures seen in https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16913/14/testReport/ are known issues or should not be related to PR changes - in all cases the servers started successfully:

  • 2-./container/boundary.py:BoundaryTest.test_container_boundary - https://daosio.atlassian.net/browse/DAOS-18040
  • 1-./erasurecode/multiple_rank_failure.py:EcodOnlineMultiRankFail.test_ec_multiple_rank_failure - https://daosio.atlassian.net/browse/DAOS-16339
  • 1-./soak/smoke.py:SoakSmoke.test_soak_smoke - https://daosio.atlassian.net/browse/DAOS-18043
  • 6-./nvme/enospace.py:NvmeEnospace.test_enospace_no_aggregation - This test suppose to fail because of DER_NOSPACEbut it got Passed
  • 4-./recovery/pool_list_consolidation.py:PoolListConsolidationTest.test_lost_majority_ps_replicas -
  • 19-./daos_test/suite.py:DaosCoreTest.test_daos_rebuild_simple_interactive - timeout waiting for rebuild
  • 24-./daos_test/suite.py:DaosCoreTest.test_daos_rebuild_ec - https://daosio.atlassian.net/browse/DAOS-17657
  • 1-./recovery/cat_recov_core.py:CatRecovCoreTest.test_daos_cat_recov_core - https://daosio.atlassian.net/browse/DAOS-17977

phender avatar Oct 06 '25 13:10 phender

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/15/execution/node/553/log

daosbuild3 avatar Oct 06 '25 15:10 daosbuild3

While we still need to resolve the IOMMU issue on hdr-233, e.g.

2025-10-06 15:04:11,977 server_utils     L0592 INFO | Resetting DAOS server storage: /usr/bin/daos_server nvme reset --ignore-config
2025-10-06 15:04:11,977 run_utils        L0481 DEBUG| Running on hdr-[232-233] with a 120 second timeout: export COVFILE=/tmp/test.cov; /usr/bin/daos_server nvme reset --ignore-config
2025-10-06 15:04:16,895 run_utils        L0343 DEBUG|   hdr-232 (rc=0): <no output>
2025-10-06 15:04:16,895 run_utils        L0347 DEBUG|   hdr-233 (rc=1):
2025-10-06 15:04:16,895 run_utils        L0352 DEBUG|     ERROR: processing request parameters: storage: code = 311 description = "IOMMU capability is required to access NVMe devices but no IOMMU capability detected"
2025-10-06 15:04:16,895 run_utils        L0352 DEBUG|     ERROR: storage: code = 311 resolution = "enable IOMMU per the DAOS Admin Guide"

We were able to get a few tests to pass on the hdr-23 cluster - like 1-./container/snapshot_aggregation.py:SnapshotAggregation.test_snapshot_aggregation - where it used a server config containing:

engines:
- fabric_iface: ib_cpu0_0
  fabric_iface_port: 31317
- fabric_iface: ib_cpu1_0
  fabric_iface_port: 31417

phender avatar Oct 06 '25 21:10 phender

hdr-233 has been fixed so that VT/d is now actually enabled.

JohnMalmberg avatar Oct 06 '25 22:10 JohnMalmberg

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/18/execution/node/553/log

daosbuild3 avatar Oct 07 '25 08:10 daosbuild3

hdr-233 has been fixed so that VT/d is now actually enabled.

Now hdr-234 is reporting a problem:

2025-10-07 06:58:54,617 run_utils        L0481 DEBUG| Running on hdr-[232-234] with a 120 second timeout: export COVFILE=/tmp/test.cov; /usr/bin/daos_server nvme reset --ignore-config
2025-10-07 06:58:59,539 run_utils        L0343 DEBUG|   hdr-[232-233] (rc=0): <no output>
2025-10-07 06:58:59,539 run_utils        L0347 DEBUG|   hdr-234 (rc=1):
2025-10-07 06:58:59,539 run_utils        L0352 DEBUG|     ERROR: processing request parameters: storage: code = 311 description = "IOMMU capability is required to access NVMe devices but no IOMMU capability detected"
2025-10-07 06:58:59,539 run_utils        L0352 DEBUG|     ERROR: storage: code = 311 resolution = "enable IOMMU per the DAOS Admin Guide"

phender avatar Oct 07 '25 13:10 phender

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/17/execution/node/943/log

daosbuild3 avatar Oct 07 '25 13:10 daosbuild3

Found more nodes with "vt/d" disabled and made sure it is enabled on all hdr-23x systems.

JohnMalmberg avatar Oct 07 '25 14:10 JohnMalmberg

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/17/execution/node/971/log

daosbuild3 avatar Oct 08 '25 00:10 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/17/execution/node/957/log

daosbuild3 avatar Oct 08 '25 08:10 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/19/execution/node/911/log

daosbuild3 avatar Oct 09 '25 04:10 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/19/execution/node/970/log

daosbuild3 avatar Oct 09 '25 15:10 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/19/execution/node/925/log

daosbuild3 avatar Oct 09 '25 21:10 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/20/execution/node/553/log

daosbuild3 avatar Oct 10 '25 06:10 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/22/execution/node/954/log

daosbuild3 avatar Oct 10 '25 14:10 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/22/execution/node/909/log

daosbuild3 avatar Oct 10 '25 22:10 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/22/execution/node/864/log

daosbuild3 avatar Oct 11 '25 05:10 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/23/execution/node/519/log

daosbuild3 avatar Oct 14 '25 23:10 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/23/execution/node/429/log

daosbuild3 avatar Oct 15 '25 07:10 daosbuild3