cylc-ui icon indicating copy to clipboard operation
cylc-ui copied to clipboard

Running family displayed without any tasks

Open MetRonnie opened this issue 9 months ago • 11 comments

I've now seen a case where a workflow thinks it has a running task, but none are actually running (none or submittied, etc either, its not expected to run again for another 6 hours)

Image

Originally posted by @ColemanTom in #1999

GraphQL

Query:

{
  workflows(ids: ["access_g4_pp_grp11"]) {
    taskProxies(ids: "//20250310T1200Z/*") {
      id
      state
    }
    familyProxies(ids: "//20250310T1200Z/*") {
      id
      state
    }
    jobs(ids: "//20250310T1200Z/*") {
      id
      state
    }
  }
}

Response:

{
  "data": {
    "workflows": [
      {
        "taskProxies": [
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_005",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_004",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_001",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_007",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_009",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_006",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_002",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_008",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_003",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/archive_log",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/housekeep_remote",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/housekeep_local",
            "state": "succeeded"
          }
        ],
        "familyProxies": [
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_005",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_005",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/root",
            "state": "running"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_004",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_004",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_001",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_001",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_007",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_007",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_009",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_009",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_006",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_006",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_002",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_002",
            "state": "running"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_008",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_008",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HHH_000_003",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/RUN_ARCHIVE_003",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/HOUSEKEEP",
            "state": "succeeded"
          }
        ],
        "jobs": [
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_005/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_002/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_001/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_007/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_004/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_009/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_006/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_003/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/wait_000_008/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/archive_log/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/housekeep_remote/01",
            "state": "succeeded"
          },
          {
            "id": "~user/access_g4_pp_grp11//20250310T1200Z/housekeep_local/01",
            "state": "succeeded"
          }
        ]
      }
    ]
  }
}

MetRonnie avatar Mar 11 '25 10:03 MetRonnie

@ColemanTom what tasks are under the RUN_ARCHIVE_002 family? And does this go away if you refresh the browser?

MetRonnie avatar Mar 11 '25 10:03 MetRonnie

I should mention, ui 2.5.0 I think, and cylc-flow 8.3.6. First I've seen it, and it's been months of running many workflow. I'm not able to update versions as we are essentially in a freeze in prep forva release to operations.

ColemanTom avatar Mar 11 '25 10:03 ColemanTom

@ColemanTom what tasks are under the RUN_ARCHIVE_002 family? And does this go away if you refresh the browser?

I dont have a complete list, but at a guess, 100+.no refresh does not change anything.

ColemanTom avatar Mar 11 '25 10:03 ColemanTom

It's possible this is already fixed on 8.4.x, or even possibly might be fixed by https://github.com/cylc/cylc-flow/pull/6589

MetRonnie avatar Mar 11 '25 10:03 MetRonnie

How does GraphQL get its information? When I look at the DB, I can't see any mention of running tasks in the task_states table. I'm guessing it is something stored internally in the running workflow process on the VM?

ColemanTom avatar Mar 11 '25 21:03 ColemanTom

It's probably due to a bug in constructing the datastore that feeds the UI, and probably already fixed as noted above. At least, I think a suspiciously similar problem was fixed.

Have you tried stopping and restarting the workflow, to force reconstruction of the datastore?

hjoliver avatar Mar 12 '25 04:03 hjoliver

Have you tried stopping and restarting the workflow, to force reconstruction of the datastore?

Not intentionally, but a colleague did happen to stop and update it today, and there is no permanent running task.

ColemanTom avatar Mar 12 '25 05:03 ColemanTom

Sorry, does that mean the problem remains, after the restart?

hjoliver avatar Mar 12 '25 05:03 hjoliver

Sorry, does that mean the problem remains, after the restart?

After stop/play, the problem is not present.

I am always hesitant to stop/play things in case there is debug information you want extracted. I do have the DB, scheduler logs and job logs saved for reference purposes, but it sounds like they wouldn't be that useful if it is a transient data store in memory.

ColemanTom avatar Mar 12 '25 05:03 ColemanTom

Yes, I don't think any of the routine stored info would help much to debug this (unless perhaps it happened as a result of a series of logged interventions). Good to know that the restart fixed it - that pretty much confirms it's the datastore (perhaps with a small chance of it being a bug in how the UI applies the data feed).

There's probably not much we can do unless we have a reproducible case to examine, and then the first thing would be to run it with the latest Cylc code to check if it's fixed already.

hjoliver avatar Mar 12 '25 05:03 hjoliver

Think it's probably the same thing we've been seeing, that Ronnie fixed (well half of it).. We should get this one in too: https://github.com/cylc/cylc-flow/pull/6589

(I've also seen this bug regularly in NIWA operations, which I will upgrade to the next version that includes this fix)

dwsutherland avatar Mar 14 '25 01:03 dwsutherland