daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-10002 test: Add rank failure test

Open shimizukko opened this issue 3 years ago • 18 comments

Rank Failure Test

Run the following test steps with data protection (rf:1) with RP_3GX and EC_4P2GX object class.

  1. Create a pool and a container. Create a container with redundancy factor.
  2. Run IOR with given object class and let it run through step 7.
  3. While IOR is running, kill all daos_engine on a non-access-point node
  4. Wait for IOR to complete.
  5. Verify that IOR failed.
  6. Wait for rebuild to finish.
  7. Restart daos_servers.
  8. Verify the system status by calling dmg system query.
  9. Call dmg pool query -b to find the disabled ranks.
  10. Call dmg pool reintegrate one rank at a time to enable all ranks.
  11. Verify that the container Health is HEALTHY.
  12. Run IOR and verify that it works.

Rank Failure Isolation Test

Stop daos_engine where pool is not created.

  1. Determine the two ranks to create the pool and a node to kill the engines.
  2. Create a pool across two ranks on the same node.
  3. Create a container without redundancy factor.
  4. Run IOR with oclass SX.
  5. While IOR is running, kill daos_engine process from two of the ranks where the pool isn’t created. This will simulate the case where there’s a node failure, but doesn’t affect the user because their pool isn’t created on the failed node (assuming that everything else such as network, client node, etc. are still working).
  6. Verify that IOR finishes successfully.
  7. Verify that the container Health is HEALTHY.
  8. To further verify that the pool isn’t affected, create a new container on the pool and run IOR.
  9. To make avocado happy, restart daos_servers on the node where the engines were killed.

Skip-unit-tests: true Test-tag: rank_failure

Signed-off-by: Makito Kano [email protected]

shimizukko avatar Apr 29 '22 17:04 shimizukko

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/3/execution/node/362/log

daosbuild1 avatar May 01 '22 10:05 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/5/execution/node/1050/log

daosbuild1 avatar May 01 '22 17:05 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/6/execution/node/362/log

daosbuild1 avatar May 01 '22 18:05 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/10/execution/node/103/log

daosbuild1 avatar May 06 '22 16:05 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/11/execution/node/104/log

daosbuild1 avatar May 06 '22 17:05 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/14/execution/node/104/log

daosbuild1 avatar May 09 '22 18:05 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/17/execution/node/103/log

daosbuild1 avatar May 12 '22 17:05 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/21/execution/node/380/log

daosbuild1 avatar Jun 11 '22 09:06 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/22/execution/node/887/log

daosbuild1 avatar Jun 20 '22 21:06 daosbuild1

Bug-tracker data: Ticket title is 'On-Site Fault Management Test - Server Rank Failure' Status is 'Blocked' https://daosio.atlassian.net/browse/DAOS-10002

github-actions[bot] avatar Aug 10 '22 17:08 github-actions[bot]

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-8874/23/display/redirect

daosbuild1 avatar Aug 10 '22 23:08 daosbuild1

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-8874/29/testReport/(root)/

daosbuild1 avatar Sep 07 '22 00:09 daosbuild1

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-8874/30/testReport/(root)/

daosbuild1 avatar Sep 07 '22 19:09 daosbuild1

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-8874/31/testReport/(root)/

daosbuild1 avatar Sep 09 '22 09:09 daosbuild1

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-8874/33/testReport/(root)/

daosbuild1 avatar Sep 11 '22 05:09 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/36/execution/node/674/log

daosbuild1 avatar Sep 17 '22 01:09 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/37/execution/node/673/log

daosbuild1 avatar Sep 18 '22 16:09 daosbuild1

After reintegrating the first rank on line 212, rebuild gets stuck in busy even after waiting for 480 sec.

shimizukko avatar Oct 13 '22 15:10 shimizukko

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/38/execution/node/670/log

daosbuild1 avatar Oct 26 '22 03:10 daosbuild1

server_rank_failure.yaml needs to be fixed to follow the new format for the storage section:

server_config:
  name: daos_server
  crt_timeout: 60
  engines_per_host: 2
  engines:
    0:
      pinned_numa_node: 0
      nr_xs_helpers: 1
      fabric_iface: ib0
      fabric_iface_port: 31317
      log_file: daos_server0.log
      log_mask: DEBUG
      storage:
        0:
          class: dcpm
          scm_list: ["/dev/pmem0"]
          scm_mount: /mnt/daos0
        1:
          class: nvme
          bdev_list: ["0000:00:00.0"]
      targets: 16
      env_vars:
        - SWIM_PROTOCOL_PERIOD_LEN=2000
        - SWIM_SUSPECT_TIMEOUT=19000
        - SWIM_PING_TIMEOUT=1900
        - DD_MASK=io,rebuild
    1:
      pinned_numa_node: 1
      nr_xs_helpers: 1
      fabric_iface: ib1
      fabric_iface_port: 31417
      log_file: daos_server1.log
      log_mask: DEBUG
      storage:
        0:
          class: dcpm
          scm_list: ["/dev/pmem1"]
          scm_mount: /mnt/daos1
        1:
          class: nvme
          bdev_list: ["0000:00:00.1"]
      targets: 16
      env_vars:
        - SWIM_PROTOCOL_PERIOD_LEN=2000
        - SWIM_SUSPECT_TIMEOUT=19000
        - SWIM_PING_TIMEOUT=1900
        - DD_MASK=io,rebuild

shimizukko avatar Nov 22 '22 23:11 shimizukko

The issue with rebuild getting stuck with busy might be fixed when the rebuild detection method is updated with https://github.com/daos-stack/daos/pull/10829

shimizukko avatar Nov 22 '22 23:11 shimizukko

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/39/execution/node/670/log

daosbuild1 avatar Nov 23 '22 04:11 daosbuild1

Test still fails while waiting for rebuild: https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-8874/39/tests

I'm not sure if #10829 will help, since dmg pool query returns busy, and #10829 is more about interpreting version+state correctly. So it seems rebuild is actually taking a long time. I'll try with an increased timeout

daltonbohning avatar Nov 23 '22 22:11 daltonbohning

Interestingly, test_server_rank_failure_with_rp passed 6 times, but test_server_rank_failure_with_ec failed

daltonbohning avatar Nov 23 '22 22:11 daltonbohning

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/40/execution/node/670/log

daosbuild1 avatar Nov 24 '22 03:11 daosbuild1

The latest runs timed out with issues that I think #10829 will resolve: https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-8874/40/testReport/

daltonbohning avatar Nov 28 '22 18:11 daltonbohning

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/41/execution/node/678/log

daosbuild1 avatar Jan 06 '23 07:01 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/42/execution/node/708/log

daosbuild1 avatar Jan 23 '23 04:01 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/43/execution/node/708/log

daosbuild1 avatar Jan 24 '23 00:01 daosbuild1

Test hang for 30 min. at daos container get-prop. Both repeated runs show the same issue.

21:36:05 INFO | Running '/usr/bin/daos -j container get-prop F5E59CE1-0297-4A2C-9698-5A22B99F7713 FEEED6BD-D772-42C0-B67E-BDC3BB62BDE4 --properties=status'
21:36:05 DEBUG| [stdout] {
21:36:05 DEBUG| [stdout]   "response": [
21:36:05 DEBUG| [stdout]     {
21:36:05 DEBUG| [stdout]       "value": "HEALTHY",
21:36:05 DEBUG| [stdout]       "name": "status",
21:36:05 DEBUG| [stdout]       "description": "Health"
21:36:05 DEBUG| [stdout]     }
21:36:05 DEBUG| [stdout]   ],
21:36:05 DEBUG| [stdout]   "error": null,
21:36:05 DEBUG| [stdout]   "status": 0
21:36:05 DEBUG| [stdout] }
22:04:44 ERROR| 
22:04:44 ERROR| Reproduced traceback from: /usr/lib/python3.6/site-packages/avocado/core/test.py:767
22:04:44 ERROR| Traceback (most recent call last):

Rerun the test and see if it's consistent.

shimizukko avatar Jan 24 '23 05:01 shimizukko