DAOS-10002 test: Add rank failure test
Rank Failure Test
Run the following test steps with data protection (rf:1) with RP_3GX and EC_4P2GX object class.
- Create a pool and a container. Create a container with redundancy factor.
- Run IOR with given object class and let it run through step 7.
- While IOR is running, kill all daos_engine on a non-access-point node
- Wait for IOR to complete.
- Verify that IOR failed.
- Wait for rebuild to finish.
- Restart daos_servers.
- Verify the system status by calling dmg system query.
- Call dmg pool query -b to find the disabled ranks.
- Call dmg pool reintegrate one rank at a time to enable all ranks.
- Verify that the container Health is HEALTHY.
- Run IOR and verify that it works.
Rank Failure Isolation Test
Stop daos_engine where pool is not created.
- Determine the two ranks to create the pool and a node to kill the engines.
- Create a pool across two ranks on the same node.
- Create a container without redundancy factor.
- Run IOR with oclass SX.
- While IOR is running, kill daos_engine process from two of the ranks where the pool isn’t created. This will simulate the case where there’s a node failure, but doesn’t affect the user because their pool isn’t created on the failed node (assuming that everything else such as network, client node, etc. are still working).
- Verify that IOR finishes successfully.
- Verify that the container Health is HEALTHY.
- To further verify that the pool isn’t affected, create a new container on the pool and run IOR.
- To make avocado happy, restart daos_servers on the node where the engines were killed.
Skip-unit-tests: true Test-tag: rank_failure
Signed-off-by: Makito Kano [email protected]
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/3/execution/node/362/log
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/5/execution/node/1050/log
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/6/execution/node/362/log
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/10/execution/node/103/log
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/11/execution/node/104/log
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/14/execution/node/104/log
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/17/execution/node/103/log
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/21/execution/node/380/log
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/22/execution/node/887/log
Bug-tracker data: Ticket title is 'On-Site Fault Management Test - Server Rank Failure' Status is 'Blocked' https://daosio.atlassian.net/browse/DAOS-10002
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-8874/23/display/redirect
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-8874/29/testReport/(root)/
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-8874/30/testReport/(root)/
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-8874/31/testReport/(root)/
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-8874/33/testReport/(root)/
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/36/execution/node/674/log
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/37/execution/node/673/log
After reintegrating the first rank on line 212, rebuild gets stuck in busy even after waiting for 480 sec.
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/38/execution/node/670/log
server_rank_failure.yaml needs to be fixed to follow the new format for the storage section:
server_config:
name: daos_server
crt_timeout: 60
engines_per_host: 2
engines:
0:
pinned_numa_node: 0
nr_xs_helpers: 1
fabric_iface: ib0
fabric_iface_port: 31317
log_file: daos_server0.log
log_mask: DEBUG
storage:
0:
class: dcpm
scm_list: ["/dev/pmem0"]
scm_mount: /mnt/daos0
1:
class: nvme
bdev_list: ["0000:00:00.0"]
targets: 16
env_vars:
- SWIM_PROTOCOL_PERIOD_LEN=2000
- SWIM_SUSPECT_TIMEOUT=19000
- SWIM_PING_TIMEOUT=1900
- DD_MASK=io,rebuild
1:
pinned_numa_node: 1
nr_xs_helpers: 1
fabric_iface: ib1
fabric_iface_port: 31417
log_file: daos_server1.log
log_mask: DEBUG
storage:
0:
class: dcpm
scm_list: ["/dev/pmem1"]
scm_mount: /mnt/daos1
1:
class: nvme
bdev_list: ["0000:00:00.1"]
targets: 16
env_vars:
- SWIM_PROTOCOL_PERIOD_LEN=2000
- SWIM_SUSPECT_TIMEOUT=19000
- SWIM_PING_TIMEOUT=1900
- DD_MASK=io,rebuild
The issue with rebuild getting stuck with busy might be fixed when the rebuild detection method is updated with https://github.com/daos-stack/daos/pull/10829
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/39/execution/node/670/log
Test still fails while waiting for rebuild: https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-8874/39/tests
I'm not sure if #10829 will help, since dmg pool query returns busy, and #10829 is more about interpreting version+state correctly. So it seems rebuild is actually taking a long time. I'll try with an increased timeout
Interestingly, test_server_rank_failure_with_rp passed 6 times, but test_server_rank_failure_with_ec failed
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/40/execution/node/670/log
The latest runs timed out with issues that I think #10829 will resolve: https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-8874/40/testReport/
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/41/execution/node/678/log
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/42/execution/node/708/log
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8874/43/execution/node/708/log
Test hang for 30 min. at daos container get-prop. Both repeated runs show the same issue.
21:36:05 INFO | Running '/usr/bin/daos -j container get-prop F5E59CE1-0297-4A2C-9698-5A22B99F7713 FEEED6BD-D772-42C0-B67E-BDC3BB62BDE4 --properties=status'
21:36:05 DEBUG| [stdout] {
21:36:05 DEBUG| [stdout] "response": [
21:36:05 DEBUG| [stdout] {
21:36:05 DEBUG| [stdout] "value": "HEALTHY",
21:36:05 DEBUG| [stdout] "name": "status",
21:36:05 DEBUG| [stdout] "description": "Health"
21:36:05 DEBUG| [stdout] }
21:36:05 DEBUG| [stdout] ],
21:36:05 DEBUG| [stdout] "error": null,
21:36:05 DEBUG| [stdout] "status": 0
21:36:05 DEBUG| [stdout] }
22:04:44 ERROR|
22:04:44 ERROR| Reproduced traceback from: /usr/lib/python3.6/site-packages/avocado/core/test.py:767
22:04:44 ERROR| Traceback (most recent call last):
Rerun the test and see if it's consistent.