Running family displayed without any tasks
I've now seen a case where a workflow thinks it has a running task, but none are actually running (none or submittied, etc either, its not expected to run again for another 6 hours)
Originally posted by @ColemanTom in #1999
GraphQL
Query:
{
workflows(ids: ["access_g4_pp_grp11"]) {
taskProxies(ids: "//20250310T1200Z/*") {
id
state
}
familyProxies(ids: "//20250310T1200Z/*") {
id
state
}
jobs(ids: "//20250310T1200Z/*") {
id
state
}
}
}
Response:
{
"data": {
"workflows": [
{
"taskProxies": [
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_005",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_004",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_001",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_007",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_009",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_006",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_002",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_008",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_003",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/archive_log",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/housekeep_remote",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/housekeep_local",
"state": "succeeded"
}
],
"familyProxies": [
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_005",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_005",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/root",
"state": "running"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_004",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_004",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_001",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_001",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_007",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_007",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_009",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_009",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_006",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_006",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_002",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_002",
"state": "running"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_008",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_008",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_003",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_003",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/HOUSEKEEP",
"state": "succeeded"
}
],
"jobs": [
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_005/01",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_002/01",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_001/01",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_007/01",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_004/01",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_009/01",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_006/01",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_003/01",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_008/01",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/archive_log/01",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/housekeep_remote/01",
"state": "succeeded"
},
{
"id": "~user/access_g4_pp_grp11//20250310T1200Z/housekeep_local/01",
"state": "succeeded"
}
]
}
]
}
}
@ColemanTom what tasks are under the RUN_ARCHIVE_002 family? And does this go away if you refresh the browser?
I should mention, ui 2.5.0 I think, and cylc-flow 8.3.6. First I've seen it, and it's been months of running many workflow. I'm not able to update versions as we are essentially in a freeze in prep forva release to operations.
@ColemanTom what tasks are under the
RUN_ARCHIVE_002family? And does this go away if you refresh the browser?
I dont have a complete list, but at a guess, 100+.no refresh does not change anything.
It's possible this is already fixed on 8.4.x, or even possibly might be fixed by https://github.com/cylc/cylc-flow/pull/6589
How does GraphQL get its information? When I look at the DB, I can't see any mention of running tasks in the task_states table. I'm guessing it is something stored internally in the running workflow process on the VM?
It's probably due to a bug in constructing the datastore that feeds the UI, and probably already fixed as noted above. At least, I think a suspiciously similar problem was fixed.
Have you tried stopping and restarting the workflow, to force reconstruction of the datastore?
Have you tried stopping and restarting the workflow, to force reconstruction of the datastore?
Not intentionally, but a colleague did happen to stop and update it today, and there is no permanent running task.
Sorry, does that mean the problem remains, after the restart?
Sorry, does that mean the problem remains, after the restart?
After stop/play, the problem is not present.
I am always hesitant to stop/play things in case there is debug information you want extracted. I do have the DB, scheduler logs and job logs saved for reference purposes, but it sounds like they wouldn't be that useful if it is a transient data store in memory.
Yes, I don't think any of the routine stored info would help much to debug this (unless perhaps it happened as a result of a series of logged interventions). Good to know that the restart fixed it - that pretty much confirms it's the datastore (perhaps with a small chance of it being a bug in how the UI applies the data feed).
There's probably not much we can do unless we have a reproducible case to examine, and then the first thing would be to run it with the latest Cylc code to check if it's fixed already.
Think it's probably the same thing we've been seeing, that Ronnie fixed (well half of it).. We should get this one in too: https://github.com/cylc/cylc-flow/pull/6589
(I've also seen this bug regularly in NIWA operations, which I will upgrade to the next version that includes this fix)