cylc-flow
cylc-flow copied to clipboard
add runtime fields/deltas to def, proxy, job data elements
closes #5054
Includes runtime field for def, proxy, and jobs .. with:
- def the task/family original workflow definition
- proxy the task/family point version, that changes (deltas available) as soon as you broadcast
- jobs, a live record of what they ran with.
example:
broadcasting:
cylc broadcast -s '[environment]GREETING = Broadcasting hello' '~sutherlander/linear/run1'
and then running foo
again results in:
{
"id": "~sutherlander/linear/run1//20220901T00/foo",
"state": "succeeded",
"runtime": {
"platform": "",
"script": "sleep 2; echo \"$GREETING\"",
"initScript": "echo 'Me first'",
"envScript": "echo \"Hi first, I'm second\"",
"errScript": "echo 'Boo!'",
"exitScript": "echo 'Yay!'",
"preScript": "sleep 1",
"postScript": "sleep 1",
"workSubDir": "",
"executionTimeLimit": 0,
"directives": {},
"environment": {
"GREETING": "Broadcasting hello"
},
"outputs": {}
},
"jobs": [
{
"id": "~sutherlander/linear/run1//20220901T00/foo/02",
"state": "succeeded",
"jobLogDir": "/home/sutherlander/cylc-run/linear/run1/log/job/20220901T00/foo/02",
"runtime": {
"platform": "localhost",
"script": "sleep 2; echo \"$GREETING\"",
"initScript": "echo 'Me first'",
"envScript": "echo \"Hi first, I'm second\"",
"errScript": "echo 'Boo!'",
"exitScript": "echo 'Yay!'",
"preScript": "sleep 1",
"postScript": "sleep 1",
"workSubDir": "",
"executionTimeLimit": 0,
"directives": {},
"environment": {
"GREETING": "Broadcasting hello"
},
"outputs": {}
}
},
{
"id": "~sutherlander/linear/run1//20220901T00/foo/01",
"state": "succeeded",
"jobLogDir": "/home/sutherlander/cylc-run/linear/run1/log/job/20220901T00/foo/01",
"runtime": {
"platform": "localhost",
"script": "sleep 2; echo \"$GREETING\"",
"initScript": "echo 'Me first'",
"envScript": "echo \"Hi first, I'm second\"",
"errScript": "echo 'Boo!'",
"exitScript": "echo 'Yay!'",
"preScript": "sleep 1",
"postScript": "sleep 1",
"workSubDir": "",
"executionTimeLimit": 0,
"directives": {},
"environment": {
"GREETING": "Hello from foo!"
},
"outputs": {}
}
}
]
}
Cancel broadcast will create deltas on effected proxy nodes also.
Check List
- [x] I have read
CONTRIBUTING.md
and added my name as a Code Contributor. - [x] Contains logically grouped changes (else tidy your branch by rebase).
- [x] Does not contain off-topic changes (use other PRs for other changes).
- [x] Applied any dependency changes to both
setup.cfg
andconda-environment.yml
. - [x] Tests are included (or explain why tests are not needed).
- [x]
CHANGES.md
entry included if this is a change that can affect users - [x] Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
- [x] If this is a bug fix, PRs raised to both master and the relevant maintenance branch.
Did you mean to remove cylc/flow/data_messages_pb2.py
?
Did you mean to remove
cylc/flow/data_messages_pb2.py
?
It's not removed (nothing would work otherwise), I used the corresponding protoc
:
$ protoc --version
libprotoc 3.19.4
It's just massively reduced.. like in: https://github.com/cylc/cylc-flow/pull/4901
(but not using that new version, so must have happened in protobuf==3.19.*
)..
This will need to be rebased once #4901 goes in (or visa versa)
(and protobuf module regenerated)
I'm getting some problems testing this.
Here's my workflow:
# ~/cylc-run/runtime-test/flow.cylc
[scheduling]
[[graph]]
R1 = foo
[runtime]
[[root]]
[[[environment]]]
root = 0
[[FOO]]
[[[environment]]]
FOO = 1
[[foo]]
inherit = FOO
[[[environment]]]
foo = 2
Here's my query:
query {
workflows(ids: ["runtime-test"]) {
taskProxies {
name
runtime {
environment
}
}
familyProxies {
name
runtime {
environment
}
}
tasks {
name
runtime {
environment
}
}
families {
name
runtime {
environment
}
}
}
}
I started the workflow in paused mode:
$ cylc play --pause runtime-test
And ran the query:
And ran the query, everything came out as expected:
{
"data": {
"workflows": [
{
"taskProxies": [
{
"name": "foo",
"runtime": {
"environment": {
"root": "0",
"FOO": "1",
"foo": "2"
}
}
}
],
"familyProxies": [
{
"name": "FOO",
"runtime": {
"environment": {
"root": "0",
"FOO": "1"
}
}
},
{
"name": "root",
"runtime": {
"environment": {
"root": "0"
}
}
}
],
"tasks": [
{
"name": "foo",
"runtime": {
"environment": {
"root": "0",
"FOO": "1",
"foo": "2"
}
}
}
],
"families": [
{
"name": "root",
"runtime": {
"environment": {
"root": "0"
}
}
},
{
"name": "FOO",
"runtime": {
"environment": {
"root": "0",
"FOO": "1"
}
}
}
]
}
]
}
}
So I made some broadcasts:
$ cylc broadcast runtime-test -n root -s '[environment]root=b0'
Broadcast set:
+ [*/root] [environment]root=b0
$ cylc broadcast runtime-test -n FOO -s '[environment]FOO=b1'
Broadcast set:
+ [*/FOO] [environment]FOO=b1
$ cylc broadcast runtime-test -n foo -s '[environment]foo=b2'
Broadcast set:
+ [*/foo] [environment]foo=b2
However, when I re-ran the query I found the environment variables on the TaskProxy and FamilyProxy nodes had been wiped:
{
"data": {
"workflows": [
{
"taskProxies": [
{
"name": "foo",
"runtime": {
"environment": {}
}
}
],
"familyProxies": [
{
"name": "FOO",
"runtime": {
"environment": {}
}
},
{
"name": "root",
"runtime": {
"environment": {}
}
}
],
"tasks": [
{
"name": "foo",
"runtime": {
"environment": {
"root": "0",
"FOO": "1",
"foo": "2"
}
}
}
],
"families": [
{
"name": "root",
"runtime": {
"environment": {
"root": "0"
}
}
},
{
"name": "FOO",
"runtime": {
"environment": {
"root": "0",
"FOO": "1"
}
}
}
]
}
]
}
}
The issue persisted across restarts:
$ cylc stop runtime-test
Done
$ cylc play --pause runtime-test
Another problem Oliver has just realised: the order of the [environment]
variables needs to be preserved, which we don't think it currently is by the GraphQL mechanism if using a dictionary as you've done at present
I think we'll need a list of environment variables to get around this e.g:
{
"environment": [
{"key": "answer", "value": 42}
]
}
Investigating.. it appears the job still gets it:
{
"name": "foo",
"runtime": {
"environment": {}
},
"jobs": [
{
"runtime": {
"platform": "localhost",
"script": "sleep 60",
"initScript": "",
"envScript": "",
"errScript": "",
"exitScript": "",
"preScript": "",
"postScript": "",
"workSubDir": "",
"executionTimeLimit": 0,
"directives": {},
"environment": {
"root": "b0",
"FOO": "1",
"foo": "2"
},
"outputs": {}
}
},
(just did one broadcast) So something to do with data-store.. (happens with both integer and non-integer cycling as expected)
Ok I've fixed the problem.. I was using get()
with ordered default dict, I changed to that as a workaround for some tests (so they might fail, and I'll just need to adjust the test(s))..
WRT the environment order:
Another problem Oliver has just realised: the order of the [environment] variables needs to be preserved, which we don't think it currently is by the GraphQL mechanism if using a dictionary as you've done at present
Dictionaries are ordered, and I dump them as string fields:
PbRuntime(
platform=platform,
script=rtconfig['script'],
init_script=rtconfig['init-script'],
env_script=rtconfig['env-script'],
err_script=rtconfig['err-script'],
exit_script=rtconfig['exit-script'],
pre_script=rtconfig['pre-script'],
post_script=rtconfig['post-script'],
work_sub_dir=rtconfig['work sub-directory'],
execution_time_limit=rtconfig['execution time limit'],
directives=json.dumps(directives),
environment=json.dumps(environment),
)
The resolver does:
def resolve_json_dump(root, info, **args):
field = getattr(root, to_snake_case(info.field_name), '{}') or '{}'
return json.loads(field)
So unless something odd happens between this and firing it off to the client, the fields should be fine..
I've had a look at [environment]
order and confirmed the response order is identical to that which you get in log/config/NN-start-MM.cylc
, including when inheritance is involved.
Getting a KeyError on tests/functional/retries/01-submission-retry.t
ERROR - 'directives'
Traceback (most recent call last):
File "~/cylc-flow-8.1/cylc/flow/scheduler.py", line 621, in run_scheduler
await self.main_loop()
File "~/cylc-flow-8.1/cylc/flow/scheduler.py", line 1580, in main_loop
self.release_queued_tasks()
File "~/cylc-flow-8.1/cylc/flow/scheduler.py", line 1303, in release_queued_tasks
for itask in self.task_job_mgr.submit_task_jobs(
File "~/cylc-flow-8.1/cylc/flow/task_job_mgr.py", line 481, in submit_task_jobs
self._prep_submit_task_job_error(
File "~/cylc-flow-8.1/cylc/flow/task_job_mgr.py", line 1235, in _prep_submit_task_job_error
self.task_events_mgr.process_message(
File "~/cylc-flow-8.1/cylc/flow/task_events_mgr.py", line 652, in process_message
if self._process_message_submit_failed(
File "~/cylc-flow-8.1/cylc/flow/task_events_mgr.py", line 1186, in _process_message_submit_failed
self._insert_task_job(itask, event_time, self.JOB_SUBMIT_FAIL_FLAG)
File "~/cylc-flow-8.1/cylc/flow/task_events_mgr.py", line 1264, in _insert_task_job
self.data_store_mgr.insert_job(
File "~/cylc-flow-8.1/cylc/flow/data_store_mgr.py", line 1250, in insert_job
j_buf.runtime.CopyFrom(runtime_from_config(j_cfg))
File "~/cylc-flow-8.1/cylc/flow/data_store_mgr.py", line 201, in runtime_from_config
directives = rtconfig['directives']
KeyError: 'directives'
Getting a KeyError on
tests/functional/retries/01-submission-retry.t
Sorted now.. It's because some submission failures happen before full job construction.
(this is why I had changed it before, causing the field deletion due to parsec dictionary behavior with get()
)
WRT the environment order:
Another problem Oliver has just realised: the order of the [environment] variables needs to be preserved, which we don't think it currently is by the GraphQL mechanism if using a dictionary as you've done at present
Dictionaries are ordered, and I dump them as string fields:
Dictionaries are ordered in Python land, however, they are not in either JSON or JS. Because JSON is the transport format GraphQL uses and the UI code is JS I would not expect this order to be preserved.
So I think we need a JSON structure like this to preserve order:
[
{
"key": "FOO",
"value": "42"
},
{
"key": "BAR",
"value": "answer"
}
]
Here's the problem:
An object is an unordered set of name/value pairs.
-- https://www.json.org/json-en.html
Some implementations of JSON writers/parsers might decide to preserve order, but it's completely chance if they do.
Some implementations of JSON writers/parsers might decide to preserve order, but it's completely chance if they do.
So even if the Python json library writes them in order, the parser(s) at the client end we can't guarantee?
We could just check this .. Unless there are many parsers and/or we want to keep the door open to parsers that are inconsistent.
I suspect the python side is ordered according to the dict (json
has a sort_keys
option but we don't and shouldn't use it in this case), because the graphql tests would be very flaky otherwise..
Yeah they are: https://docs.python.org/3/library/json.html
Note: This module’s encoders and decoders preserve input and output order by default. Order is only lost if the underlying containers are unordered.
The Python end is fine, it's the JS side which is problematic.
Since Python 3.7 (or CPython 3.6) dict
keys have been maintained in insertion order in Python. However, this is not true for Object
keys in JavaScript where there is no guarantee of order in the language spec (so it's like Python <=3.5). Just because we push them out in order doesn't mean they'll be read back in order, or preserve their order when operated on by JS.
I have just spotted - I don't seem to be getting the list-type fields.
[runtime]
[[foo]]
execution retry delays = 5*PT2M, PT5M
{
"data": {
"taskProxy": {
"runtime": {
"executionRetryDelays": [], // empty
...
So I think we need a JSON structure like this to preserve order:
[ { "key": "FOO", "value": "42" }, { "key": "BAR", "value": "answer" } ]
I think it would also make sense to make directives
and outputs
use that structure, even if they don't need to preserve order. (It makes it easier to handle in cylc-ui if they're consistent with environment
)
Just because we push them out in order doesn't mean they'll be read back in order, or preserve their order when operated on by JS.
Yeah, but, before I go an change it.. Does the JS json library read it in order? (regardless of the 1999
spec)
Or is there no order guarantee post read (in memory) also?
Or is there no order guarantee post read (in memory) also?
tldr; JSON doesn't guarantee order, JS doesn't guarantee order, ECMA2015/ES6 does guarantee order in some cases, for some, but not all interfaces. So order is largely down to the implementation and likely to be unstable.
Searching around I found this on ECMA2015/ES6:
Conclusion: Even in ES2015 you shouldn't rely on the property order of normal objects in JavaScript. It is prone to errors. If you need ordered named pairs, use Map instead, which purely uses insertion order. If you just need order, use an array or Set (which also uses purely insertion order).
-- https://stackoverflow.com/questions/5525795/does-javascript-guarantee-object-property-order/38218582#38218582
Generally JS things seem to use Arrays of Objects where order matters.
So I think we need a JSON structure like this to preserve order:
[ { "key": "FOO", "value": "42" }, { "key": "BAR", "value": "answer" } ]
I think it would also make sense to make
directives
andoutputs
use that structure, even if they don't need to preserve order. (It makes it easier to handle in cylc-ui if they're consistent withenvironment
)
Why not just:
[
{
"FOO": "42"
},
{
"BAR": "answer"
}
]
Any reason for the "key"
"value"
?
Have changed it to:
[
{
"FOO": "42"
},
{
"BAR": "answer"
}
]
and added outputs.
Any reason for the
"key"
"value"
?
It would make it easier to use in the Vue component.
Still have this problem where e.g. execution retry delays
doesn't come through
[runtime] [[foo]] execution retry delays = 5*PT2M, PT5M
{ "data": { "taskProxy": { "runtime": { "executionRetryDelays": [], // empty ...
Any reason for the
"key"
"value"
?It would make it easier to use in the Vue component.
So you don't have to iterate the dict? Ok will change..
Fixed .. however, like the execution timeout, the string values are converted on config parse:
execution retry delays = PT15S, PT10M, PT1H, PT3H
submission retry delays = PT2M
"submissionRetryDelays": [
"120.0"
],
"executionRetryDelays": [
"15.0",
"600.0",
"3600.0",
"10800.0"
]
so that's a wider issue of not having the original string available.
so that's a wider issue of not having the original string available.
Parsec can convert these values back to the original form which is what happens when we run cylc config <id>
so we should be able to handle it.
If we were to cylc broadcast
these number values back to the workflow it would rejected as invalid. For the "broadcast edit" work we'll have to check whether this applies to any other fields.
I've "stringif"d the float lists, and the numerical types already have a str
method to convert into appropriate list:
graphql output
{
"name": "foo",
"runtime": {
"environment": [
{
"key": "GREETING",
"value": "Hello from foo!"
}
],
"outputs": [
{
"key": "OUTPUT1",
"value": "yes"
},
{
"key": "OUTPUT2",
"value": "no"
}
],
"submissionRetryDelays": "PT2M",
"submissionPollingIntervals": "PT1M, PT2M",
"executionTimeLimit": "PT1H",
"executionRetryDelays": "PT15S, PT10M, PT1H, PT3H",
"executionPollingIntervals": "PT1M, PT10M, PT10M, PT10M, PT10M, PT10M, PT1M"
},
"outputs": [
{
"label": "expired",
"message": "expired",
"satisfied": false,
"time": 1667363549.6685379
},
{
"label": "submitted",
"message": "submitted",
"satisfied": true,
"time": 1667363558.671344
},
{
"label": "submit-failed",
"message": "submit-failed",
"satisfied": false,
"time": 1667363549.6685379
},
{
"label": "started",
"message": "started",
"satisfied": true,
"time": 1667363559.6712284
},
{
"label": "succeeded",
"message": "succeeded",
"satisfied": false,
"time": 1667363549.6685379
},
{
"label": "failed",
"message": "failed",
"satisfied": false,
"time": 1667363549.6685379
},
{
"label": "OUTPUT1",
"message": "yes",
"satisfied": false,
"time": 1667363549.6685379
},
{
"label": "OUTPUT2",
"message": "no",
"satisfied": false,
"time": 1667363549.6685379
}
],
"jobs": [
{
"executionTimeLimit": 3600,
"runtime": {
"platform": "localhost",
"script": "sleep 2; echo \"$GREETING\"",
"initScript": "echo 'Me first'",
"envScript": "echo \"Hi first, I'm second\"",
"errScript": "echo 'Boo!'",
"exitScript": "echo 'Yay!'",
"preScript": "sleep 1",
"postScript": "sleep 1",
"workSubDir": "",
"executionTimeLimit": "PT1H",
"directives": [],
"environment": [
{
"key": "GREETING",
"value": "Hello from foo!"
}
],
"outputs": [
{
"key": "OUTPUT1",
"value": "yes"
},
{
"key": "OUTPUT2",
"value": "no"
}
]
}
}
]
}
One thing, they obviously don't get grouped and/or replicate the original exactly:
[[FAM]]
execution time limit = PT1H
execution polling intervals = PT1M, 5*PT10M, PT1M
execution retry delays = PT15S, PT10M, PT1H, PT3H
submission polling intervals = PT1M, PT2M
submission retry delays = PT2M
(the function I'm calling does attempt to group for other things)
Also when doing a broadcast;
cylc broadcast linear -n FAM -s 'execution retry delays=PT25S'
it doesn't get converted into that shorthand time format:
"submissionRetryDelays": "PT2M",
"submissionPollingIntervals": "PT1M, PT2M",
"executionTimeLimit": "PT1H",
"executionRetryDelays": "25.0",
"executionPollingIntervals": "PT1M, PT10M, PT10M, PT10M, PT10M, PT10M, PT1M"
This is because the broadcasts use regular objects and not the same as our config.. This is will be my next commit, to change how broadcasts are assimilated and produce overrides.
Amongst other things, the latest commit fixes the incorrect displayed format: before
$ cylc broadcast linear -n FAM -s 'execution retry delays=PT35S, PT60S, PT130M'
Broadcast set:
+ [*/FAM] execution retry delays=[35.0, 60.0, 7800.0]
after
$ cylc broadcast linear -n FAM -s 'execution retry delays=PT35S, PT60S, PT130M'
Broadcast set:
+ [*/FAM] execution retry delays=PT35S, PT60S, PT130M
It does this by keeping the external (to the scheduler) settings value in raw .cylc
format, external includes:
- Arguments from broadcast command/mutation
- Saved DB format
- GraphQL/API runtime fields
And coercing this external format to internal objects (i.e. duration fields (execution retry delays
...) to DurationFloat
values) when assimilating DB loads or CLI/API broadcast put mutations in the scheduler.
Previously the internal format was generated externally then read in like this..
Which meant we would get float
instead of DurationFloat
internally (as DurationFloat
isn't passed over json)... This is a bug/mistake (resulting in the CLI-response and API-field being in seconds).
Have added to an existing test, to cover these fields..
(there may be some test failures from DB loads that have broadcast settings saved in the Float
format.. will address)
hmm...
Error: Codecov: Failed to properly upload: The process '/home/runner/work/_actions/codecov/codecov-action/v3/dist/codecov' failed with exit code 255
for both "failed" tests (2nd attempt) .. the actual tests passed (and coverage)
ah... that token thing:
[2022-11-07T07:06:27.545Z] ['error'] There was an error running the uploader: Error uploading to [https://codecov.io:](https://codecov.io/) Error: There was an error fetching the storage URL during POST: 404 - {'detail': ErrorDetail(string='Unable to locate build via Github Actions API. Please upload with the Codecov repository upload token to resolve issue.', code='not_found')}