longhorn icon indicating copy to clipboard operation
longhorn copied to clipboard

[BUG] tests related to ha_salvage_test failed & flaky

Open shuo-wu opened this issue 2 years ago • 7 comments

Describe the bug

test_ha_salvage and test_ha_salvage_with_backing_image may fail intermittently in the night results.

To Reproduce

Run the tests

Expected behavior

The tests should pass

Log or Support bundle

client = <longhorn.Client object at 0x7fb7ea9ace20>
core_api = <kubernetes.client.api.core_v1_api.CoreV1Api object at 0x7fb7ea1539a0>
volume_name = 'longhorn-testvol-orm48c', disable_auto_salvage = None

    @pytest.mark.coretest   # NOQA
    def test_ha_salvage(client, core_api, volume_name, disable_auto_salvage):  # NOQA
        """
        [HA] Test salvage when volume faulted
        TODO
        The test cases should cover the following four cases:
        1. Manual salvage with revision counter enabled.
        2. Manual salvage with revision counter disabled.
        3. Auto salvage with revision counter enabled.
        4. Auto salvage with revision counter enabled.
    
        Setting: Disable auto salvage
    
        Case 1: Delete all replica processes using instance manager
    
        1. Create volume and attach to the current node
        2. Write `data` to the volume.
        3. Crash all the replicas using Instance Manager API
            1. Cannot do it using Longhorn API since a. it will delete data, b. the
        last replica is not allowed to be deleted
        4. Make sure volume detached automatically and changed into `faulted` state
        5. Make sure both replicas reports `failedAt` timestamp.
        6. Salvage the volume
        7. Verify that volume is in `detached` `unknown` state. No longer `faulted`
        8. Verify that all the replicas' `failedAt` timestamp cleaned.
        9. Attach the volume and check `data`
    
        Case 2: Crash all replica processes
    
        Same steps as Case 1 except on step 3, use SIGTERM to crash the processes
    
        Setting: Enabled auto salvage.
    
        Case 3: Revision counter disabled.
    
        1. Set 'Automatic salvage' to true.
        2. Set 'Disable Revision Counter' to true.
        3. Create a volume with 3 replicas.
        4. Attach the volume to a node and write some data to it and save the
        checksum.
        5. Delete all replica processes using instance manager or
        crash all replica processes using SIGTERM.
        6. Wait for volume to `faulted`, then `healthy`.
        7. Verify all 3 replicas are reused successfully.
        8. Check the data in the volume and make sure it's the same as the
        checksum saved on step 5.
    
        Case 4: Revision counter enabled.
    
        1. Set 'Automatic salvage' to true.
        2. Set 'Disable Revision Counter' to false.
        4. Create a volume with 3 replicas.
        5. Attach the volume to a node and write some data to it and save the
        checksum.
        6. Delete all replica processes using instance manager or
        crash all replica processes using SIGTERM.
        7. Wait for volume to `faulted`, then `healthy`.
        8. Verify there are 3 replicas, they are all from previous replicas.
        9. Check the data in the volume and make sure it's the same as the
        checksum saved on step 5.
    
        """
>       ha_salvage_test(client, core_api, volume_name)

test_ha.py:217: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_ha.py:290: in ha_salvage_test
    volume.salvage(names=[replica0_name, replica1_name])
longhorn.py:262: in cb
    return self.action(_result, _link_name,
longhorn.py:457: in action
    return self._post_and_retry(url, *args, **kw)
longhorn.py:415: in _post_and_retry
    raise e
longhorn.py:409: in _post_and_retry
    return self._post(url, data=self._to_dict(*args, **kw))
longhorn.py:74: in wrapped
    return fn(*args, **kw)
longhorn.py:303: in _post
    self._error(r.text, r.status_code)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <longhorn.Client object at 0x7fb7ea9ace20>
text = '{"actions":{},"code":"Server Error","detail":"","links":{"self":"http://10.42.2.5:9500/v1/volumes/longhorn-testvol-or... to salvage volume longhorn-testvol-orm48c: invalid volume state to salvage: detaching","status":500,"type":"error"}\n'
status_code = 500

    def _error(self, text, status_code):
>       raise ApiError(self._unmarshall(text), status_code)
E       longhorn.ApiError: (ApiError(...), "500 : unable to salvage volume longhorn-testvol-orm48c: invalid volume state to salvage: detaching\n{'code': 500, 'detail': '', 'message': 'unable to salvage volume longhorn-testvol-orm48c: invalid volume state to salvage: detaching', 'status': 500}")

longhorn.py:283: ApiError

Environment

  • Longhorn version: v1.3.0-rc2

shuo-wu avatar Aug 09 '22 03:08 shuo-wu

cc @longhorn/qa

innobead avatar Aug 09 '22 04:08 innobead

Pre Ready-For-Testing Checklist

  • [x] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*) The automation test case PR is at https://github.com/longhorn/longhorn-tests/pull/1064

longhorn-io-github-bot avatar Aug 09 '22 04:08 longhorn-io-github-bot

Not sure if they were related but https://github.com/longhorn/longhorn/issues/4383 also failed at volume status = detaching (expect deteched )

chriscchien avatar Aug 09 '22 04:08 chriscchien

When all replicas are crashed, the volume will be detached & faulted first. Sending salvage requests during the detaching leads to this failure.

I will check https://github.com/longhorn/longhorn/issues/4383 later.

shuo-wu avatar Aug 09 '22 04:08 shuo-wu

Test passed in v1.2.x pipeline and v1.3.x. But in master branch the result were failed, need more time to wait daily build result

chriscchien avatar Aug 11 '22 06:08 chriscchien

Test passed in v1.2.x pipeline and v1.3.x. But in master branch the result were failed, need more time to wait daily build result

It seems still flaky. Need to further investigation via QA. cc @longhorn/qa

innobead avatar Aug 11 '22 12:08 innobead

Pulling back to implementation first.

innobead avatar Aug 11 '22 13:08 innobead

I can reproduce 1.3.x fail result with master-head in my local side, failed at check_volume_data as below shown

    crash_replica_processes(client, core_api, volume_name)

    volume = common.wait_for_volume_healthy(client, volume_name)
    assert len(volume.replicas) == 3

    for replica in volume.replicas:
        assert replica.name in orig_replica_names

    check_volume_data(volume, data)

From UI behavior, after crash_replica_processes volume status will change from healthy -> detaching -> faulted(not every time appear) -> attaching -> attached -> healthy

All failure were because first healthy -> detaching status changed a little bit slow, so volume = common.wait_for_volume_healthy(client, volume_name) caught the initial status after crash replica process. so when check_volume_data thy will show FileNotFoundError

chriscchien avatar Aug 12 '22 03:08 chriscchien

crash_replica_processes(client, core_api, volume_name)

so, we need to ensure the volume becomes faulted first before the following operations.

volume = common.wait_for_volume_healthy(client, volume_name)
assert len(volume.replicas) == 3

for replica in volume.replicas:
    assert replica.name in orig_replica_names

check_volume_data(volume, data)

From UI behavior, after `crash_replica_processes` volume status will change from healthy -> detaching -> faulted(not every time appear) -> attaching -> attached -> healthy

All failure were because first healthy -> detaching status changed a little bit slow, so ` volume = common.wait_for_volume_healthy(client, volume_name)` caught the initial status after crash replica process. so when `check_volume_data` thy will show FileNotFoundError

Thanks, @cchien816 for the investigation. cc @shuo-wu @longhorn/qa

innobead avatar Aug 12 '22 03:08 innobead

Removing the state from Zenhub board, because this is tracked in the QA project board instead.

innobead avatar Aug 19 '22 08:08 innobead

Close this ticket because after PR https://github.com/longhorn/longhorn-tests/pull/1075 merged, test case were pass in recently build 225, 226, 227.

Also in v1.3.x build 116 and v1.2.x build 200 were passed after PR backported

chriscchien avatar Aug 24 '22 05:08 chriscchien