cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

add runtime fields/deltas to def, proxy, job data elements

Open dwsutherland opened this issue 2 years ago • 3 comments

closes #5054

Includes runtime field for def, proxy, and jobs .. with:

  • def the task/family original workflow definition
  • proxy the task/family point version, that changes (deltas available) as soon as you broadcast
  • jobs, a live record of what they ran with.

example: image broadcasting:

cylc broadcast -s '[environment]GREETING = Broadcasting hello' '~sutherlander/linear/run1'

and then running foo again results in:

          {
            "id": "~sutherlander/linear/run1//20220901T00/foo",
            "state": "succeeded",
            "runtime": {
              "platform": "",
              "script": "sleep 2; echo \"$GREETING\"",
              "initScript": "echo 'Me first'",
              "envScript": "echo \"Hi first, I'm second\"",
              "errScript": "echo 'Boo!'",
              "exitScript": "echo 'Yay!'",
              "preScript": "sleep 1",
              "postScript": "sleep 1",
              "workSubDir": "",
              "executionTimeLimit": 0,
              "directives": {},
              "environment": {
                "GREETING": "Broadcasting hello"
              },
              "outputs": {}
            },
            "jobs": [
              {
                "id": "~sutherlander/linear/run1//20220901T00/foo/02",
                "state": "succeeded",
                "jobLogDir": "/home/sutherlander/cylc-run/linear/run1/log/job/20220901T00/foo/02",
                "runtime": {
                  "platform": "localhost",
                  "script": "sleep 2; echo \"$GREETING\"",
                  "initScript": "echo 'Me first'",
                  "envScript": "echo \"Hi first, I'm second\"",
                  "errScript": "echo 'Boo!'",
                  "exitScript": "echo 'Yay!'",
                  "preScript": "sleep 1",
                  "postScript": "sleep 1",
                  "workSubDir": "",
                  "executionTimeLimit": 0,
                  "directives": {},
                  "environment": {
                    "GREETING": "Broadcasting hello"
                  },
                  "outputs": {}
                }
              },
              {
                "id": "~sutherlander/linear/run1//20220901T00/foo/01",
                "state": "succeeded",
                "jobLogDir": "/home/sutherlander/cylc-run/linear/run1/log/job/20220901T00/foo/01",
                "runtime": {
                  "platform": "localhost",
                  "script": "sleep 2; echo \"$GREETING\"",
                  "initScript": "echo 'Me first'",
                  "envScript": "echo \"Hi first, I'm second\"",
                  "errScript": "echo 'Boo!'",
                  "exitScript": "echo 'Yay!'",
                  "preScript": "sleep 1",
                  "postScript": "sleep 1",
                  "workSubDir": "",
                  "executionTimeLimit": 0,
                  "directives": {},
                  "environment": {
                    "GREETING": "Hello from foo!"
                  },
                  "outputs": {}
                }
              }
            ]
          }

Cancel broadcast will create deltas on effected proxy nodes also.

Check List

  • [x] I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • [x] Contains logically grouped changes (else tidy your branch by rebase).
  • [x] Does not contain off-topic changes (use other PRs for other changes).
  • [x] Applied any dependency changes to both setup.cfg and conda-environment.yml.
  • [x] Tests are included (or explain why tests are not needed).
  • [x] CHANGES.md entry included if this is a change that can affect users
  • [x] Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • [x] If this is a bug fix, PRs raised to both master and the relevant maintenance branch.

dwsutherland avatar Sep 16 '22 00:09 dwsutherland

Did you mean to remove cylc/flow/data_messages_pb2.py ?

hjoliver avatar Sep 16 '22 02:09 hjoliver

Did you mean to remove cylc/flow/data_messages_pb2.py ?

It's not removed (nothing would work otherwise), I used the corresponding protoc:

$ protoc --version
libprotoc 3.19.4

It's just massively reduced.. like in: https://github.com/cylc/cylc-flow/pull/4901

(but not using that new version, so must have happened in protobuf==3.19.*)..

dwsutherland avatar Sep 16 '22 02:09 dwsutherland

This will need to be rebased once #4901 goes in (or visa versa)

(and protobuf module regenerated)

dwsutherland avatar Sep 29 '22 01:09 dwsutherland

I'm getting some problems testing this.

Here's my workflow:

# ~/cylc-run/runtime-test/flow.cylc
[scheduling]
    [[graph]]
        R1 = foo

[runtime]
    [[root]]
        [[[environment]]]
            root = 0
    [[FOO]]
        [[[environment]]]
            FOO = 1
    [[foo]]
        inherit = FOO
        [[[environment]]]
            foo = 2

Here's my query:

query {
  workflows(ids: ["runtime-test"]) {
    taskProxies {
			name
      runtime {
        environment 
      }
    }
    familyProxies {
      name
      runtime {
        environment
      }
    }
    tasks {
      name
      runtime {
        environment
      }
    }
    families {
      name
      runtime {
        environment
      }
    }
  }
}

I started the workflow in paused mode:

$ cylc play --pause runtime-test

And ran the query:

And ran the query, everything came out as expected:
{
  "data": {
    "workflows": [
      {
        "taskProxies": [
          {
            "name": "foo",
            "runtime": {
              "environment": {
                "root": "0",
                "FOO": "1",
                "foo": "2"
              }
            }
          }
        ],
        "familyProxies": [
          {
            "name": "FOO",
            "runtime": {
              "environment": {
                "root": "0",
                "FOO": "1"
              }
            }
          },
          {
            "name": "root",
            "runtime": {
              "environment": {
                "root": "0"
              }
            }
          }
        ],
        "tasks": [
          {
            "name": "foo",
            "runtime": {
              "environment": {
                "root": "0",
                "FOO": "1",
                "foo": "2"
              }
            }
          }
        ],
        "families": [
          {
            "name": "root",
            "runtime": {
              "environment": {
                "root": "0"
              }
            }
          },
          {
            "name": "FOO",
            "runtime": {
              "environment": {
                "root": "0",
                "FOO": "1"
              }
            }
          }
        ]
      }
    ]
  }
}

So I made some broadcasts:

$ cylc broadcast runtime-test -n root -s '[environment]root=b0'
Broadcast set:
+ [*/root] [environment]root=b0
$ cylc broadcast runtime-test -n FOO -s '[environment]FOO=b1'
Broadcast set:
+ [*/FOO] [environment]FOO=b1
$ cylc broadcast runtime-test -n foo -s '[environment]foo=b2'
Broadcast set:
+ [*/foo] [environment]foo=b2
However, when I re-ran the query I found the environment variables on the TaskProxy and FamilyProxy nodes had been wiped:
{
  "data": {
    "workflows": [
      {
        "taskProxies": [
          {
            "name": "foo",
            "runtime": {
              "environment": {}
            }
          }
        ],
        "familyProxies": [
          {
            "name": "FOO",
            "runtime": {
              "environment": {}
            }
          },
          {
            "name": "root",
            "runtime": {
              "environment": {}
            }
          }
        ],
        "tasks": [
          {
            "name": "foo",
            "runtime": {
              "environment": {
                "root": "0",
                "FOO": "1",
                "foo": "2"
              }
            }
          }
        ],
        "families": [
          {
            "name": "root",
            "runtime": {
              "environment": {
                "root": "0"
              }
            }
          },
          {
            "name": "FOO",
            "runtime": {
              "environment": {
                "root": "0",
                "FOO": "1"
              }
            }
          }
        ]
      }
    ]
  }
}

The issue persisted across restarts:

$ cylc stop runtime-test
Done
$ cylc play --pause runtime-test

oliver-sanders avatar Oct 03 '22 11:10 oliver-sanders

Another problem Oliver has just realised: the order of the [environment] variables needs to be preserved, which we don't think it currently is by the GraphQL mechanism if using a dictionary as you've done at present

MetRonnie avatar Oct 03 '22 13:10 MetRonnie

I think we'll need a list of environment variables to get around this e.g:

{
  "environment": [
    {"key": "answer", "value": 42}
  ]
}

oliver-sanders avatar Oct 03 '22 14:10 oliver-sanders

Investigating.. it appears the job still gets it:

          {
            "name": "foo",
            "runtime": {
              "environment": {}
            },
            "jobs": [
              {
                "runtime": {
                  "platform": "localhost",
                  "script": "sleep 60",
                  "initScript": "",
                  "envScript": "",
                  "errScript": "",
                  "exitScript": "",
                  "preScript": "",
                  "postScript": "",
                  "workSubDir": "",
                  "executionTimeLimit": 0,
                  "directives": {},
                  "environment": {
                    "root": "b0",
                    "FOO": "1",
                    "foo": "2"
                  },
                  "outputs": {}
                }
              },

(just did one broadcast) So something to do with data-store.. (happens with both integer and non-integer cycling as expected)

dwsutherland avatar Oct 11 '22 08:10 dwsutherland

Ok I've fixed the problem.. I was using get() with ordered default dict, I changed to that as a workaround for some tests (so they might fail, and I'll just need to adjust the test(s))..

WRT the environment order:

Another problem Oliver has just realised: the order of the [environment] variables needs to be preserved, which we don't think it currently is by the GraphQL mechanism if using a dictionary as you've done at present

Dictionaries are ordered, and I dump them as string fields:

PbRuntime(
        platform=platform,
        script=rtconfig['script'],
        init_script=rtconfig['init-script'],
        env_script=rtconfig['env-script'],
        err_script=rtconfig['err-script'],
        exit_script=rtconfig['exit-script'],
        pre_script=rtconfig['pre-script'],
        post_script=rtconfig['post-script'],
        work_sub_dir=rtconfig['work sub-directory'],
        execution_time_limit=rtconfig['execution time limit'],
        directives=json.dumps(directives),
        environment=json.dumps(environment),
    )

The resolver does:

def resolve_json_dump(root, info, **args):
    field = getattr(root, to_snake_case(info.field_name), '{}') or '{}'
    return json.loads(field)

So unless something odd happens between this and firing it off to the client, the fields should be fine..

dwsutherland avatar Oct 12 '22 07:10 dwsutherland

I've had a look at [environment] order and confirmed the response order is identical to that which you get in log/config/NN-start-MM.cylc, including when inheritance is involved.

MetRonnie avatar Oct 12 '22 13:10 MetRonnie

Getting a KeyError on tests/functional/retries/01-submission-retry.t

ERROR - 'directives'
Traceback (most recent call last):
  File "~/cylc-flow-8.1/cylc/flow/scheduler.py", line 621, in run_scheduler
    await self.main_loop()
  File "~/cylc-flow-8.1/cylc/flow/scheduler.py", line 1580, in main_loop
    self.release_queued_tasks()
  File "~/cylc-flow-8.1/cylc/flow/scheduler.py", line 1303, in release_queued_tasks
    for itask in self.task_job_mgr.submit_task_jobs(
  File "~/cylc-flow-8.1/cylc/flow/task_job_mgr.py", line 481, in submit_task_jobs
    self._prep_submit_task_job_error(
  File "~/cylc-flow-8.1/cylc/flow/task_job_mgr.py", line 1235, in _prep_submit_task_job_error
    self.task_events_mgr.process_message(
  File "~/cylc-flow-8.1/cylc/flow/task_events_mgr.py", line 652, in process_message
    if self._process_message_submit_failed(
  File "~/cylc-flow-8.1/cylc/flow/task_events_mgr.py", line 1186, in _process_message_submit_failed
    self._insert_task_job(itask, event_time, self.JOB_SUBMIT_FAIL_FLAG)
  File "~/cylc-flow-8.1/cylc/flow/task_events_mgr.py", line 1264, in _insert_task_job
    self.data_store_mgr.insert_job(
  File "~/cylc-flow-8.1/cylc/flow/data_store_mgr.py", line 1250, in insert_job
    j_buf.runtime.CopyFrom(runtime_from_config(j_cfg))
  File "~/cylc-flow-8.1/cylc/flow/data_store_mgr.py", line 201, in runtime_from_config
    directives = rtconfig['directives']
KeyError: 'directives'

MetRonnie avatar Oct 12 '22 13:10 MetRonnie

Getting a KeyError on tests/functional/retries/01-submission-retry.t

Sorted now.. It's because some submission failures happen before full job construction. (this is why I had changed it before, causing the field deletion due to parsec dictionary behavior with get())

dwsutherland avatar Oct 17 '22 06:10 dwsutherland

WRT the environment order:

Another problem Oliver has just realised: the order of the [environment] variables needs to be preserved, which we don't think it currently is by the GraphQL mechanism if using a dictionary as you've done at present

Dictionaries are ordered, and I dump them as string fields:

Dictionaries are ordered in Python land, however, they are not in either JSON or JS. Because JSON is the transport format GraphQL uses and the UI code is JS I would not expect this order to be preserved.

So I think we need a JSON structure like this to preserve order:

[
  {
    "key": "FOO",
    "value": "42"
  },
  {
    "key": "BAR",
    "value": "answer"
  }
]

oliver-sanders avatar Oct 18 '22 10:10 oliver-sanders

Here's the problem:

An object is an unordered set of name/value pairs.

-- https://www.json.org/json-en.html

Some implementations of JSON writers/parsers might decide to preserve order, but it's completely chance if they do.

oliver-sanders avatar Oct 18 '22 11:10 oliver-sanders

Some implementations of JSON writers/parsers might decide to preserve order, but it's completely chance if they do.

So even if the Python json library writes them in order, the parser(s) at the client end we can't guarantee?

We could just check this .. Unless there are many parsers and/or we want to keep the door open to parsers that are inconsistent.

dwsutherland avatar Oct 18 '22 23:10 dwsutherland

I suspect the python side is ordered according to the dict (json has a sort_keys option but we don't and shouldn't use it in this case), because the graphql tests would be very flaky otherwise..

Yeah they are: https://docs.python.org/3/library/json.html

Note: This module’s encoders and decoders preserve input and output order by default. Order is only lost if the underlying containers are unordered.

dwsutherland avatar Oct 19 '22 00:10 dwsutherland

The Python end is fine, it's the JS side which is problematic.

Since Python 3.7 (or CPython 3.6) dict keys have been maintained in insertion order in Python. However, this is not true for Object keys in JavaScript where there is no guarantee of order in the language spec (so it's like Python <=3.5). Just because we push them out in order doesn't mean they'll be read back in order, or preserve their order when operated on by JS.

oliver-sanders avatar Oct 25 '22 09:10 oliver-sanders

I have just spotted - I don't seem to be getting the list-type fields.

[runtime]
    [[foo]]
        execution retry delays = 5*PT2M, PT5M
{
  "data": {
    "taskProxy": {
      "runtime": {
        "executionRetryDelays": [], // empty
        ...

MetRonnie avatar Oct 25 '22 12:10 MetRonnie

So I think we need a JSON structure like this to preserve order:

[
  {
    "key": "FOO",
    "value": "42"
  },
  {
    "key": "BAR",
    "value": "answer"
  }
]

I think it would also make sense to make directives and outputs use that structure, even if they don't need to preserve order. (It makes it easier to handle in cylc-ui if they're consistent with environment)

MetRonnie avatar Oct 27 '22 15:10 MetRonnie

Just because we push them out in order doesn't mean they'll be read back in order, or preserve their order when operated on by JS.

Yeah, but, before I go an change it.. Does the JS json library read it in order? (regardless of the 1999 spec) Or is there no order guarantee post read (in memory) also?

dwsutherland avatar Oct 28 '22 03:10 dwsutherland

Or is there no order guarantee post read (in memory) also?

tldr; JSON doesn't guarantee order, JS doesn't guarantee order, ECMA2015/ES6 does guarantee order in some cases, for some, but not all interfaces. So order is largely down to the implementation and likely to be unstable.

Searching around I found this on ECMA2015/ES6:

Conclusion: Even in ES2015 you shouldn't rely on the property order of normal objects in JavaScript. It is prone to errors. If you need ordered named pairs, use Map instead, which purely uses insertion order. If you just need order, use an array or Set (which also uses purely insertion order).

-- https://stackoverflow.com/questions/5525795/does-javascript-guarantee-object-property-order/38218582#38218582

Generally JS things seem to use Arrays of Objects where order matters.

oliver-sanders avatar Oct 28 '22 10:10 oliver-sanders

So I think we need a JSON structure like this to preserve order:

[
  {
    "key": "FOO",
    "value": "42"
  },
  {
    "key": "BAR",
    "value": "answer"
  }
]

I think it would also make sense to make directives and outputs use that structure, even if they don't need to preserve order. (It makes it easier to handle in cylc-ui if they're consistent with environment)

Why not just:

[
  {
    "FOO": "42"
  },
  {
    "BAR": "answer"
  }
]

Any reason for the "key" "value"?

dwsutherland avatar Oct 31 '22 06:10 dwsutherland

Have changed it to:

[
  {
    "FOO": "42"
  },
  {
    "BAR": "answer"
  }
]

and added outputs.

dwsutherland avatar Oct 31 '22 06:10 dwsutherland

Any reason for the "key" "value"?

It would make it easier to use in the Vue component.

MetRonnie avatar Oct 31 '22 10:10 MetRonnie

Still have this problem where e.g. execution retry delays doesn't come through

[runtime]
    [[foo]]
        execution retry delays = 5*PT2M, PT5M
{
  "data": {
    "taskProxy": {
      "runtime": {
        "executionRetryDelays": [], // empty
        ...

MetRonnie avatar Oct 31 '22 13:10 MetRonnie

Any reason for the "key" "value"?

It would make it easier to use in the Vue component.

So you don't have to iterate the dict? Ok will change..

dwsutherland avatar Nov 01 '22 01:11 dwsutherland

Fixed .. however, like the execution timeout, the string values are converted on config parse:

        execution retry delays = PT15S, PT10M, PT1H, PT3H
        submission retry delays = PT2M
              "submissionRetryDelays": [
                "120.0"
              ],
              "executionRetryDelays": [
                "15.0",
                "600.0",
                "3600.0",
                "10800.0"
              ]

so that's a wider issue of not having the original string available.

dwsutherland avatar Nov 01 '22 05:11 dwsutherland

so that's a wider issue of not having the original string available.

Parsec can convert these values back to the original form which is what happens when we run cylc config <id> so we should be able to handle it.

If we were to cylc broadcast these number values back to the workflow it would rejected as invalid. For the "broadcast edit" work we'll have to check whether this applies to any other fields.

oliver-sanders avatar Nov 01 '22 10:11 oliver-sanders

I've "stringif"d the float lists, and the numerical types already have a str method to convert into appropriate list:

graphql output
          {
            "name": "foo",
            "runtime": {
              "environment": [
                {
                  "key": "GREETING",
                  "value": "Hello from foo!"
                }
              ],
              "outputs": [
                {
                  "key": "OUTPUT1",
                  "value": "yes"
                },
                {
                  "key": "OUTPUT2",
                  "value": "no"
                }
              ],
              "submissionRetryDelays": "PT2M",
              "submissionPollingIntervals": "PT1M, PT2M",
              "executionTimeLimit": "PT1H",
              "executionRetryDelays": "PT15S, PT10M, PT1H, PT3H",
              "executionPollingIntervals": "PT1M, PT10M, PT10M, PT10M, PT10M, PT10M, PT1M"
            },
            "outputs": [
              {
                "label": "expired",
                "message": "expired",
                "satisfied": false,
                "time": 1667363549.6685379
              },
              {
                "label": "submitted",
                "message": "submitted",
                "satisfied": true,
                "time": 1667363558.671344
              },
              {
                "label": "submit-failed",
                "message": "submit-failed",
                "satisfied": false,
                "time": 1667363549.6685379
              },
              {
                "label": "started",
                "message": "started",
                "satisfied": true,
                "time": 1667363559.6712284
              },
              {
                "label": "succeeded",
                "message": "succeeded",
                "satisfied": false,
                "time": 1667363549.6685379
              },
              {
                "label": "failed",
                "message": "failed",
                "satisfied": false,
                "time": 1667363549.6685379
              },
              {
                "label": "OUTPUT1",
                "message": "yes",
                "satisfied": false,
                "time": 1667363549.6685379
              },
              {
                "label": "OUTPUT2",
                "message": "no",
                "satisfied": false,
                "time": 1667363549.6685379
              }
            ],
            "jobs": [
              {
                "executionTimeLimit": 3600,
                "runtime": {
                  "platform": "localhost",
                  "script": "sleep 2; echo \"$GREETING\"",
                  "initScript": "echo 'Me first'",
                  "envScript": "echo \"Hi first, I'm second\"",
                  "errScript": "echo 'Boo!'",
                  "exitScript": "echo 'Yay!'",
                  "preScript": "sleep 1",
                  "postScript": "sleep 1",
                  "workSubDir": "",
                  "executionTimeLimit": "PT1H",
                  "directives": [],
                  "environment": [
                    {
                      "key": "GREETING",
                      "value": "Hello from foo!"
                    }
                  ],
                  "outputs": [
                    {
                      "key": "OUTPUT1",
                      "value": "yes"
                    },
                    {
                      "key": "OUTPUT2",
                      "value": "no"
                    }
                  ]
                }
              }
            ]
          }

One thing, they obviously don't get grouped and/or replicate the original exactly:

    [[FAM]]
        execution time limit = PT1H
        execution polling intervals = PT1M, 5*PT10M, PT1M
        execution retry delays = PT15S, PT10M, PT1H, PT3H
        submission polling intervals = PT1M, PT2M
        submission retry delays = PT2M

(the function I'm calling does attempt to group for other things)

Also when doing a broadcast; cylc broadcast linear -n FAM -s 'execution retry delays=PT25S' it doesn't get converted into that shorthand time format:

              "submissionRetryDelays": "PT2M",
              "submissionPollingIntervals": "PT1M, PT2M",
              "executionTimeLimit": "PT1H",
              "executionRetryDelays": "25.0",
              "executionPollingIntervals": "PT1M, PT10M, PT10M, PT10M, PT10M, PT10M, PT1M"

This is because the broadcasts use regular objects and not the same as our config.. This is will be my next commit, to change how broadcasts are assimilated and produce overrides.

dwsutherland avatar Nov 02 '22 06:11 dwsutherland

Amongst other things, the latest commit fixes the incorrect displayed format: before

$ cylc broadcast linear -n FAM -s 'execution retry delays=PT35S, PT60S, PT130M'
Broadcast set:
+ [*/FAM] execution retry delays=[35.0, 60.0, 7800.0]

after

$ cylc broadcast linear -n FAM -s 'execution retry delays=PT35S, PT60S, PT130M'
Broadcast set:
+ [*/FAM] execution retry delays=PT35S, PT60S, PT130M

It does this by keeping the external (to the scheduler) settings value in raw .cylc format, external includes:

  • Arguments from broadcast command/mutation
  • Saved DB format
  • GraphQL/API runtime fields

And coercing this external format to internal objects (i.e. duration fields (execution retry delays...) to DurationFloat values) when assimilating DB loads or CLI/API broadcast put mutations in the scheduler.

Previously the internal format was generated externally then read in like this.. Which meant we would get float instead of DurationFloat internally (as DurationFloat isn't passed over json)... This is a bug/mistake (resulting in the CLI-response and API-field being in seconds).

Have added to an existing test, to cover these fields.. (there may be some test failures from DB loads that have broadcast settings saved in the Float format.. will address)

dwsutherland avatar Nov 06 '22 04:11 dwsutherland

hmm...

Error: Codecov: Failed to properly upload: The process '/home/runner/work/_actions/codecov/codecov-action/v3/dist/codecov' failed with exit code 255

for both "failed" tests (2nd attempt) .. the actual tests passed (and coverage)

ah... that token thing:

[2022-11-07T07:06:27.545Z] ['error'] There was an error running the uploader: Error uploading to [https://codecov.io:](https://codecov.io/) Error: There was an error fetching the storage URL during POST: 404 - {'detail': ErrorDetail(string='Unable to locate build via Github Actions API. Please upload with the Codecov repository upload token to resolve issue.', code='not_found')}

dwsutherland avatar Nov 07 '22 07:11 dwsutherland