cylc-ui Running family displayed without any tasks

I've now seen a case where a workflow thinks it has a running task, but none are actually running (none or submittied, etc either, its not expected to run again for another 6 hours)

Originally posted by @ColemanTom in #1999

GraphQL

Query:

{
  workflows(ids: ["access_g4_pp_grp11"]) {
    taskProxies(ids: "//20250310T1200Z/*") {
      id
      state
    }
    familyProxies(ids: "//20250310T1200Z/*") {
      id
      state
    }
    jobs(ids: "//20250310T1200Z/*") {
      id
      state
    }
  }
}

Response:

{
  "data": {
    "workflows": [
      {
        "taskProxies": [
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_005",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_004",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_001",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_007",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_009",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_006",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_002",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_008",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_003",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/archive_log",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/housekeep_remote",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/housekeep_local",
            "state": "succeeded"
          }
        ],
        "familyProxies": [
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_005",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_005",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/root",
            "state": "running"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_004",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_004",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_001",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_001",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_007",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_007",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_009",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_009",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_006",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_006",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_002",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_002",
            "state": "running"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_008",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_008",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_003",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_003",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HOUSEKEEP",
            "state": "succeeded"
          }
        ],
        "jobs": [
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_005/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_002/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_001/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_007/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_004/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_009/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_006/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_003/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_008/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/archive_log/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/housekeep_remote/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/housekeep_local/01",
            "state": "succeeded"
          }
        ]
      }
    ]
  }
}

Mar 11 '25 10:03 MetRonnie

@ColemanTom what tasks are under the RUN_ARCHIVE_002 family? And does this go away if you refresh the browser?

Mar 11 '25 10:03 MetRonnie

I should mention, ui 2.5.0 I think, and cylc-flow 8.3.6. First I've seen it, and it's been months of running many workflow. I'm not able to update versions as we are essentially in a freeze in prep forva release to operations.

Mar 11 '25 10:03 ColemanTom

@ColemanTom what tasks are under the RUN_ARCHIVE_002 family? And does this go away if you refresh the browser?

I dont have a complete list, but at a guess, 100+.no refresh does not change anything.

Mar 11 '25 10:03 ColemanTom

It's possible this is already fixed on 8.4.x, or even possibly might be fixed by https://github.com/cylc/cylc-flow/pull/6589

Mar 11 '25 10:03 MetRonnie

How does GraphQL get its information? When I look at the DB, I can't see any mention of running tasks in the task_states table. I'm guessing it is something stored internally in the running workflow process on the VM?

Mar 11 '25 21:03 ColemanTom

It's probably due to a bug in constructing the datastore that feeds the UI, and probably already fixed as noted above. At least, I think a suspiciously similar problem was fixed.

Have you tried stopping and restarting the workflow, to force reconstruction of the datastore?

Mar 12 '25 04:03 hjoliver

Have you tried stopping and restarting the workflow, to force reconstruction of the datastore?

Not intentionally, but a colleague did happen to stop and update it today, and there is no permanent running task.

Mar 12 '25 05:03 ColemanTom

Sorry, does that mean the problem remains, after the restart?

Mar 12 '25 05:03 hjoliver

Sorry, does that mean the problem remains, after the restart?

After stop/play, the problem is not present.

I am always hesitant to stop/play things in case there is debug information you want extracted. I do have the DB, scheduler logs and job logs saved for reference purposes, but it sounds like they wouldn't be that useful if it is a transient data store in memory.

Mar 12 '25 05:03 ColemanTom

Yes, I don't think any of the routine stored info would help much to debug this (unless perhaps it happened as a result of a series of logged interventions). Good to know that the restart fixed it - that pretty much confirms it's the datastore (perhaps with a small chance of it being a bug in how the UI applies the data feed).

There's probably not much we can do unless we have a reproducible case to examine, and then the first thing would be to run it with the latest Cylc code to check if it's fixed already.

Mar 12 '25 05:03 hjoliver

Think it's probably the same thing we've been seeing, that Ronnie fixed (well half of it).. We should get this one in too: https://github.com/cylc/cylc-flow/pull/6589

(I've also seen this bug regularly in NIWA operations, which I will upgrade to the next version that includes this fix)

Mar 14 '25 01:03 dwsutherland