daos DAOS-10002 test: Add rank failure test

Rank Failure Test

Run the following test steps with data protection (rf:1) with RP_3GX and EC_4P2GX object class.

Create a pool and a container. Create a container with redundancy factor.
Run IOR with given object class and let it run through step 7.
While IOR is running, kill all daos_engine on a non-access-point node
Wait for IOR to complete.
Verify that IOR failed.
Wait for rebuild to finish.
Restart daos_servers.
Verify the system status by calling dmg system query.
Call dmg pool query -b to find the disabled ranks.
Call dmg pool reintegrate one rank at a time to enable all ranks.
Verify that the container Health is HEALTHY.
Run IOR and verify that it works.

Rank Failure Isolation Test

Stop daos_engine where pool is not created.

Determine the two ranks to create the pool and a node to kill the engines.
Create a pool across two ranks on the same node.
Create a container without redundancy factor.
Run IOR with oclass SX.
While IOR is running, kill daos_engine process from two of the ranks where the pool isn’t created. This will simulate the case where there’s a node failure, but doesn’t affect the user because their pool isn’t created on the failed node (assuming that everything else such as network, client node, etc. are still working).
Verify that IOR finishes successfully.
Verify that the container Health is HEALTHY.
To further verify that the pool isn’t affected, create a new container on the pool and run IOR.
To make avocado happy, restart daos_servers on the node where the engines were killed.

Skip-unit-tests: true Test-tag: rank_failure

Signed-off-by: Makito Kano [email protected]

Apr 29 '22 17:04 shimizukko

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/3/execution/node/362/log

May 01 '22 10:05 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/5/execution/node/1050/log

May 01 '22 17:05 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/6/execution/node/362/log

May 01 '22 18:05 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/10/execution/node/103/log

May 06 '22 16:05 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/11/execution/node/104/log

May 06 '22 17:05 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/14/execution/node/104/log

May 09 '22 18:05 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/17/execution/node/103/log

May 12 '22 17:05 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/21/execution/node/380/log

Jun 11 '22 09:06 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/22/execution/node/887/log

Jun 20 '22 21:06 daosbuild1

Bug-tracker data: Ticket title is 'On-Site Fault Management Test - Server Rank Failure' Status is 'Blocked' https://daosio.atlassian.net/browse/DAOS-10002

Aug 10 '22 17:08 github-actions[bot]

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-8874/23/display/redirect

Aug 10 '22 23:08 daosbuild1

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-8874/29/testReport/(root)/

Sep 07 '22 00:09 daosbuild1

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-8874/30/testReport/(root)/

Sep 07 '22 19:09 daosbuild1

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-8874/31/testReport/(root)/

Sep 09 '22 09:09 daosbuild1

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-8874/33/testReport/(root)/

Sep 11 '22 05:09 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/36/execution/node/674/log

Sep 17 '22 01:09 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/37/execution/node/673/log

Sep 18 '22 16:09 daosbuild1

After reintegrating the first rank on line 212, rebuild gets stuck in busy even after waiting for 480 sec.

Oct 13 '22 15:10 shimizukko

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/38/execution/node/670/log

Oct 26 '22 03:10 daosbuild1

server_rank_failure.yaml needs to be fixed to follow the new format for the storage section:

server_config:
  name: daos_server
  crt_timeout: 60
  engines_per_host: 2
  engines:
    0:
      pinned_numa_node: 0
      nr_xs_helpers: 1
      fabric_iface: ib0
      fabric_iface_port: 31317
      log_file: daos_server0.log
      log_mask: DEBUG
      storage:
        0:
          class: dcpm
          scm_list: ["/dev/pmem0"]
          scm_mount: /mnt/daos0
        1:
          class: nvme
          bdev_list: ["0000:00:00.0"]
      targets: 16
      env_vars:
        - SWIM_PROTOCOL_PERIOD_LEN=2000
        - SWIM_SUSPECT_TIMEOUT=19000
        - SWIM_PING_TIMEOUT=1900
        - DD_MASK=io,rebuild
    1:
      pinned_numa_node: 1
      nr_xs_helpers: 1
      fabric_iface: ib1
      fabric_iface_port: 31417
      log_file: daos_server1.log
      log_mask: DEBUG
      storage:
        0:
          class: dcpm
          scm_list: ["/dev/pmem1"]
          scm_mount: /mnt/daos1
        1:
          class: nvme
          bdev_list: ["0000:00:00.1"]
      targets: 16
      env_vars:
        - SWIM_PROTOCOL_PERIOD_LEN=2000
        - SWIM_SUSPECT_TIMEOUT=19000
        - SWIM_PING_TIMEOUT=1900
        - DD_MASK=io,rebuild

Nov 22 '22 23:11 shimizukko

The issue with rebuild getting stuck with busy might be fixed when the rebuild detection method is updated with https://github.com/daos-stack/daos/pull/10829

Nov 22 '22 23:11 shimizukko

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/39/execution/node/670/log

Nov 23 '22 04:11 daosbuild1

Test still fails while waiting for rebuild: https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-8874/39/tests

I'm not sure if #10829 will help, since dmg pool query returns busy, and #10829 is more about interpreting version+state correctly. So it seems rebuild is actually taking a long time. I'll try with an increased timeout

Nov 23 '22 22:11 daltonbohning

Interestingly, test_server_rank_failure_with_rp passed 6 times, but test_server_rank_failure_with_ec failed

Nov 23 '22 22:11 daltonbohning

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/40/execution/node/670/log

Nov 24 '22 03:11 daosbuild1

The latest runs timed out with issues that I think #10829 will resolve: https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-8874/40/testReport/

Nov 28 '22 18:11 daltonbohning

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/41/execution/node/678/log

Jan 06 '23 07:01 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/42/execution/node/708/log

Jan 23 '23 04:01 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/43/execution/node/708/log

Jan 24 '23 00:01 daosbuild1

Test hang for 30 min. at daos container get-prop. Both repeated runs show the same issue.

21:36:05 INFO | Running '/usr/bin/daos -j container get-prop F5E59CE1-0297-4A2C-9698-5A22B99F7713 FEEED6BD-D772-42C0-B67E-BDC3BB62BDE4 --properties=status'
21:36:05 DEBUG| [stdout] {
21:36:05 DEBUG| [stdout]   "response": [
21:36:05 DEBUG| [stdout]     {
21:36:05 DEBUG| [stdout]       "value": "HEALTHY",
21:36:05 DEBUG| [stdout]       "name": "status",
21:36:05 DEBUG| [stdout]       "description": "Health"
21:36:05 DEBUG| [stdout]     }
21:36:05 DEBUG| [stdout]   ],
21:36:05 DEBUG| [stdout]   "error": null,
21:36:05 DEBUG| [stdout]   "status": 0
21:36:05 DEBUG| [stdout] }
22:04:44 ERROR| 
22:04:44 ERROR| Reproduced traceback from: /usr/lib/python3.6/site-packages/avocado/core/test.py:767
22:04:44 ERROR| Traceback (most recent call last):

Rerun the test and see if it's consistent.

Jan 24 '23 05:01 shimizukko