cylc-flow tests/integration: data store mgr tests unstable

The data store mgr tests or a subset thereof sometimes fail and sometimes pass.

Some example traceback:

    def test_update_data_structure(harness):
        """Test update_data_structure. This method will generate and
        apply adeltas/updates given."""
        schd, data = harness
        w_id = schd.data_store_mgr.workflow_id
        schd.data_store_mgr.data[w_id] = data
        assert TASK_STATUS_FAILED not in set(collect_states(data, TASK_PROXIES))
        assert TASK_STATUS_FAILED not in set(collect_states(data, FAMILY_PROXIES))
        assert TASK_STATUS_FAILED not in data[WORKFLOW].state_totals
>       assert len({t.is_held for t in data[TASK_PROXIES].values()}) == 2
E       assert 0 == 2
E         +0
E         -2

The test file:

tests/integration/test_data_store_mgr.py

Apr 15 '21 10:04 oliver-sanders

Have run it twenty times locally and no failure... I will see if there's a more robust way to test the update.

Apr 21 '21 05:04 dwsutherland

I'd have to understand how the test harness works at a deeper level to know why it's inconsistent..

i.e. why does this:

@pytest.mark.asyncio
@pytest.fixture(scope='module')
async def harness(mod_flow, mod_scheduler, mod_run):
    flow_def = {
        'scheduler': {
            'allow implicit tasks': True
        },
        'scheduling': {
            'graph': {
                'R1': 'foo => bar'
            }
        }
    }
    reg: str = mod_flow(flow_def)
    schd: 'Scheduler' = mod_scheduler(reg)
    async with mod_run(schd):
        schd.pool.hold_tasks('*')
        schd.resume_workflow()
        # Think this is needed to save the data state at first start (?)
        # Fails without it.. and a test needs to overwrite schd data with this.
        data = schd.data_store_mgr.data[schd.data_store_mgr.workflow_id]
        yield schd, data

Sometimes yield the scheduler with no task proxies? Did the tasks not exist when the hold command was issued?

Hard to tell unless I can reproduce the issue somehow.. (the test in question still worked when I commented out the hold)

Apr 21 '21 05:04 dwsutherland

I think the same harness yield gets shared amongst the tests.. Whenever I changes the order of the tests, it appears to effect what data the next tests are working with.

So perhaps create another harness object (if possible), and see if it occurs again ... Again, easier if I could reproduce..

Apr 21 '21 06:04 dwsutherland

On a side note, this also works:

(flow) sutherlander@cortex-vbox:cylc-flow$ git diff
diff --git a/tests/integration/test_data_store_mgr.py b/tests/integration/test_data_store_mgr.py
index 2d6d584cc..533c30a60 100644
--- a/tests/integration/test_data_store_mgr.py
+++ b/tests/integration/test_data_store_mgr.py
@@ -259,8 +259,8 @@ def test_update_data_structure(harness):
     schd.data_store_mgr.data[w_id] = data
     assert TASK_STATUS_FAILED not in set(collect_states(data, TASK_PROXIES))
     assert TASK_STATUS_FAILED not in set(collect_states(data, FAMILY_PROXIES))
-    assert TASK_STATUS_FAILED not in data[WORKFLOW].state_totals
-    assert len({t.is_held for t in data[TASK_PROXIES].values()}) == 2
+    assert data[WORKFLOW].state_totals[TASK_STATUS_FAILED] == 0
+    assert sum(data[WORKFLOW].state_totals.values()) == 2
     for itask in schd.pool.get_all_tasks():
         itask.state.reset(TASK_STATUS_FAILED)
         schd.data_store_mgr.delta_task_state(itask)

So I don't think the key lookup is working..
I did this because the manager changed to reset the state totals for families being updated:

            # Use all states to clean up pruned counts
            for state in TASK_STATUSES_ORDERED:
                fp_delta.state_totals[state] = state_counter.get(state, 0)
            fp_updated.setdefault(fp_id, PbFamilyProxy()).MergeFrom(fp_delta)

Apr 21 '21 06:04 dwsutherland

Sometimes yield the scheduler with no task proxies?

Ah, perhaps we have to explicitly get the pool to release tasks, otherwise there is a race condition between the async with and the async def main_loop?

Apr 21 '21 09:04 oliver-sanders

I assume this is still an issue?

Jan 17 '22 09:01 dwsutherland

I still run into this one, it seems to happen more often with a higher level of parallelism in the tests. Still have no idea why it's happening.

Jan 17 '22 10:01 oliver-sanders

Was playing with a small change to the integration tests which caused these tests to break badly so took a look in.

I think the problem is:

The main loop is running whilst the tests are running which means the scheduler can do things in-between tests, or not, depending on how asyncio manages tasks which means the data store may be updated, or not between tests. This would explain why I see these failures a lot, whereas we don't hit them much on GHA (because I parallelise the hell out of the tests!).
The tests all share the same workflow, but the tests also modify the workflow and don't undo their changes which means the tests can interact.

I've addressed (1) on my branch by adding a new fixture that lets us start scheduler without running the main loop to give these tests a fighting chance.

I've fixed the resulting failures bar one, which I don't understand in test_update_data_structure:

# Shows pruning worked
assert len({t.is_held for t in data[TASK_PROXIES].values()}) == 1

The test resets a task state to failed, then expects the number of is_held tasks to change as a result? Commented out for now.

Jan 31 '22 12:01 oliver-sanders

Done (1) in #4620, may have helped, however, I still get flaky failures ~50% of the time for:

test_delta_task_held
test_update_data_structure

With high parallelism.

Jan 31 '22 17:01 oliver-sanders

Well, not sure what to do about (2) other than have tests operate on different workflows, as it's not something that would happen in the wild..

Also I'm not sure the tests could "undo" their changes, a) wouldn't parallelism still cause issues while undoing? b) the cheapest way to undo would be to reload/reinitialize the data-store, assuming other tests aren't running at the time.

Oct 21 '22 00:10 dwsutherland

Is this still a problem? Seems fairly stable now

Dec 11 '25 22:12 dwsutherland