longhorn
longhorn copied to clipboard
[BUG] tests related to ha_salvage_test failed & flaky
Describe the bug
test_ha_salvage
and test_ha_salvage_with_backing_image
may fail intermittently in the night results.
To Reproduce
Run the tests
Expected behavior
The tests should pass
Log or Support bundle
client = <longhorn.Client object at 0x7fb7ea9ace20>
core_api = <kubernetes.client.api.core_v1_api.CoreV1Api object at 0x7fb7ea1539a0>
volume_name = 'longhorn-testvol-orm48c', disable_auto_salvage = None
@pytest.mark.coretest # NOQA
def test_ha_salvage(client, core_api, volume_name, disable_auto_salvage): # NOQA
"""
[HA] Test salvage when volume faulted
TODO
The test cases should cover the following four cases:
1. Manual salvage with revision counter enabled.
2. Manual salvage with revision counter disabled.
3. Auto salvage with revision counter enabled.
4. Auto salvage with revision counter enabled.
Setting: Disable auto salvage
Case 1: Delete all replica processes using instance manager
1. Create volume and attach to the current node
2. Write `data` to the volume.
3. Crash all the replicas using Instance Manager API
1. Cannot do it using Longhorn API since a. it will delete data, b. the
last replica is not allowed to be deleted
4. Make sure volume detached automatically and changed into `faulted` state
5. Make sure both replicas reports `failedAt` timestamp.
6. Salvage the volume
7. Verify that volume is in `detached` `unknown` state. No longer `faulted`
8. Verify that all the replicas' `failedAt` timestamp cleaned.
9. Attach the volume and check `data`
Case 2: Crash all replica processes
Same steps as Case 1 except on step 3, use SIGTERM to crash the processes
Setting: Enabled auto salvage.
Case 3: Revision counter disabled.
1. Set 'Automatic salvage' to true.
2. Set 'Disable Revision Counter' to true.
3. Create a volume with 3 replicas.
4. Attach the volume to a node and write some data to it and save the
checksum.
5. Delete all replica processes using instance manager or
crash all replica processes using SIGTERM.
6. Wait for volume to `faulted`, then `healthy`.
7. Verify all 3 replicas are reused successfully.
8. Check the data in the volume and make sure it's the same as the
checksum saved on step 5.
Case 4: Revision counter enabled.
1. Set 'Automatic salvage' to true.
2. Set 'Disable Revision Counter' to false.
4. Create a volume with 3 replicas.
5. Attach the volume to a node and write some data to it and save the
checksum.
6. Delete all replica processes using instance manager or
crash all replica processes using SIGTERM.
7. Wait for volume to `faulted`, then `healthy`.
8. Verify there are 3 replicas, they are all from previous replicas.
9. Check the data in the volume and make sure it's the same as the
checksum saved on step 5.
"""
> ha_salvage_test(client, core_api, volume_name)
test_ha.py:217:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test_ha.py:290: in ha_salvage_test
volume.salvage(names=[replica0_name, replica1_name])
longhorn.py:262: in cb
return self.action(_result, _link_name,
longhorn.py:457: in action
return self._post_and_retry(url, *args, **kw)
longhorn.py:415: in _post_and_retry
raise e
longhorn.py:409: in _post_and_retry
return self._post(url, data=self._to_dict(*args, **kw))
longhorn.py:74: in wrapped
return fn(*args, **kw)
longhorn.py:303: in _post
self._error(r.text, r.status_code)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <longhorn.Client object at 0x7fb7ea9ace20>
text = '{"actions":{},"code":"Server Error","detail":"","links":{"self":"http://10.42.2.5:9500/v1/volumes/longhorn-testvol-or... to salvage volume longhorn-testvol-orm48c: invalid volume state to salvage: detaching","status":500,"type":"error"}\n'
status_code = 500
def _error(self, text, status_code):
> raise ApiError(self._unmarshall(text), status_code)
E longhorn.ApiError: (ApiError(...), "500 : unable to salvage volume longhorn-testvol-orm48c: invalid volume state to salvage: detaching\n{'code': 500, 'detail': '', 'message': 'unable to salvage volume longhorn-testvol-orm48c: invalid volume state to salvage: detaching', 'status': 500}")
longhorn.py:283: ApiError
Environment
- Longhorn version: v1.3.0-rc2
cc @longhorn/qa
Pre Ready-For-Testing Checklist
- [x] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including
backport-needed/*
) The automation test case PR is at https://github.com/longhorn/longhorn-tests/pull/1064
Not sure if they were related but https://github.com/longhorn/longhorn/issues/4383 also failed at volume status = detaching (expect deteched )
When all replicas are crashed, the volume will be detached & faulted first. Sending salvage requests during the detaching leads to this failure.
I will check https://github.com/longhorn/longhorn/issues/4383 later.
Test passed in v1.2.x pipeline and v1.3.x. But in master branch the result were failed, need more time to wait daily build result
Test passed in v1.2.x pipeline and v1.3.x. But in master branch the result were failed, need more time to wait daily build result
It seems still flaky. Need to further investigation via QA. cc @longhorn/qa
Pulling back to implementation
first.
I can reproduce 1.3.x fail result with master-head in my local side, failed at check_volume_data
as below shown
crash_replica_processes(client, core_api, volume_name)
volume = common.wait_for_volume_healthy(client, volume_name)
assert len(volume.replicas) == 3
for replica in volume.replicas:
assert replica.name in orig_replica_names
check_volume_data(volume, data)
From UI behavior, after crash_replica_processes
volume status will change from healthy -> detaching -> faulted(not every time appear) -> attaching -> attached -> healthy
All failure were because first healthy -> detaching status changed a little bit slow, so volume = common.wait_for_volume_healthy(client, volume_name)
caught the initial status after crash replica process. so when check_volume_data
thy will show FileNotFoundError
crash_replica_processes(client, core_api, volume_name)
so, we need to ensure the volume becomes faulted first before the following operations.
volume = common.wait_for_volume_healthy(client, volume_name) assert len(volume.replicas) == 3 for replica in volume.replicas: assert replica.name in orig_replica_names check_volume_data(volume, data)
From UI behavior, after `crash_replica_processes` volume status will change from healthy -> detaching -> faulted(not every time appear) -> attaching -> attached -> healthy All failure were because first healthy -> detaching status changed a little bit slow, so ` volume = common.wait_for_volume_healthy(client, volume_name)` caught the initial status after crash replica process. so when `check_volume_data` thy will show FileNotFoundError
Thanks, @cchien816 for the investigation. cc @shuo-wu @longhorn/qa
Removing the state from Zenhub board, because this is tracked in the QA project board instead.