maestrowf
maestrowf copied to clipboard
added flux adaptor v0.31.0
- copied over the v0.26 file to a new one for v0.31.
- added a new "latest" version value.
simply followed the existing style and strategy to expand to a new version. are we going to need a new file for each version?
@bhatiaharsh -- Thanks for filing this PR; since you made a new file, I can't tell if anything is actually different in the adapter compared to 0.26.0
. I can check later, but if this is the same interface as 0.26.0
then there's no need to check in a new file for it. The file just needs to be checked in if 0.31.0
has different API calls and changes that the prior adapter can't handle.
okay. i just copied the file, since i thought you're grabbing the appropriate module out. i think the new file is not needed. when you get a chance, would you please add the minimal needed here?
@bhatiaharsh -- does the adapter work when specifying 0.26.0
in the most recent version of Flux? If there's nothing abnormal happening I don't think any modifications need to be made.
One thing that I had been intending to do is to make the Flux adapters more version range focused, or to outright try and detect what version was in the environment. But that would have to be a future todo item.
@bhatiaharsh -- does the adapter work when specifying
0.26.0
in the most recent version of Flux? If there's nothing abnormal happening I don't think any modifications need to be made.One thing that I had been intending to do is to make the Flux adapters more version range focused, or to outright try and detect what version was in the environment. But that would have to be a future todo item.
So, looking at the flux factory this range selection looks easy to implement on the dispatch side at least (though sounds like detecting flux version could be trouble?). Question there is are the existing versions amenable to that such that we could change to ranges bracketed by the existing adapters plus a >= 0.26, or do we need something more like exact versions for the old ones ( < 0.26 ) and use the ranges only going forward from 0.26?
the v0.26
adapter did work just fine, so i dont think any changes are needed.
the only reason i went in to make this change was that when maestro creates the submit script, it adds the wrong version (0.26), because that seems to be hardcoded in. if there was a way to pull out the correct version, then these adapters would be more generic. ideally, the factory would pick the correct adapter based on the flux version.
the
v0.26
adapter did work just fine, so i dont think any changes are needed.the only reason i went in to make this change was that when maestro creates the submit script, it adds the wrong version (0.26), because that seems to be hardcoded in. if there was a way to pull out the correct version, then these adapters would be more generic. ideally, the factory would pick the correct adapter based on the flux version.
The number at the top of the script is reflective of the backend adapter key that was used to generate and submit the script. That's so that there's a record of what was used for debug if something goes wrong. There would be something like that with the ranged versioning most likely since we'd want to be able to root cause issues on the right adapter.
This conversation is giving me ideas on how to best handle this issue, but as Jeremy said the ideal would be to detect the version which means that Flux has to be in the environment with the right Python bindings. If I'm recalling correctly, @jwhite242 was running into some issues with that on LC machines (and that was the reason for going with a purely Spack installed version of Flux + getting latest functionality).
the
v0.26
adapter did work just fine, so i dont think any changes are needed. the only reason i went in to make this change was that when maestro creates the submit script, it adds the wrong version (0.26), because that seems to be hardcoded in. if there was a way to pull out the correct version, then these adapters would be more generic. ideally, the factory would pick the correct adapter based on the flux version.The number at the top of the script is reflective of the backend adapter key that was used to generate and submit the script. That's so that there's a record of what was used for debug if something goes wrong. There would be something like that with the ranged versioning most likely since we'd want to be able to root cause issues on the right adapter.
This conversation is giving me ideas on how to best handle this issue, but as Jeremy said the ideal would be to detect the version which means that Flux has to be in the environment with the right Python bindings. If I'm recalling correctly, @jwhite242 was running into some issues with that on LC machines (and that was the reason for going with a purely Spack installed version of Flux + getting latest functionality).
Yeah, until this gets resolved (https://github.com/flux-framework/flux-core/issues/2145) I think we do have to just continue building it fresh every time to ensure it's talking to the same python as maestro, or make a cli based version like with the other adapters if that ui is stable enough yet. Even with that though, that would be useful for provenance/debugging if we can get the actual version dumped into the generated files.
it does look like something is broken. just documenting the log here until i figure out more detail. @FrankD412 will recognize this mummi logging.
2022-05-18 14:48:56,952 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:504 - DEBUG - [148.312 MB]: [createsim] Fetching status for 45 jobs
2022-05-18 14:48:56,952 - [wf_manager:164728] maestrowf.interfaces.script.fluxscriptadapter:check_jobs:226 - DEBUG - [148.312 MB]: Joblist type -- <class 'list'>
2022-05-18 14:48:56,953 - [wf_manager:164728] maestrowf.interfaces.script.fluxscriptadapter:check_jobs:227 - DEBUG - [148.312 MB]: Joblist contents -- [989855790080, 995375494144, 1000928752640, 1006414902272, 1011917829120, 1017403978752, 1022890128384, 1028376278016, 1033895982080, 1039398908928, 1044885058560, 1050371208192, 1055924466688, 1061427393536, 1066913543168, 1072433247232, 1077919396864, 1083405546496, 1088891696128, 1094377845760, 1099863995392, 1105350145024, 1110819517440, 1116305667072, 1121791816704, 1127277966336, 1132764115968, 1138250265600, 1143736415232, 1149239342080, 1154742268928, 1160245195776, 1165731345408, 1171217495040, 1176703644672, 1182189794304, 1187726275584, 1193245979648, 1198732129280, 1204218278912, 1209721205760, 1215391904768, 1220945163264, 1226431312896, 1231934239744]
2022-05-18 14:48:56,953 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:get_statuses:129 - DEBUG - [148.312 MB]: Handle address -- 0x20001ffcb550
2022-05-18 14:48:56,991 - [wf_manager:164728] mummi_core.workflow.jobTracker:check_sim_status:263 - DEBUG - [148.312 MB]: [createsim] found (/gpfs/alpine/proj-shared/lrn005/Campaign4/roots/mummi_c4b_051422/sims-cg/mu18-0ras0raf2ras4A1raf4A-i6_000000001657)/(createsims_success)
2022-05-18 14:48:56,995 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:561 - DEBUG - [148.312 MB]: [JOBID 1132764115968, JOB Job[createsim]: id = 1132764115968, sims = ['mu18-0ras0raf2ras4A1raf4A-i6_000000001657']] : Status: [<SimulationStatus.Failed: 2>] | Running? True | Timedout? False | Continue? False | Cancel? True |
2022-05-18 14:48:56,996 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:566 - DEBUG - [148.312 MB]: [createsim] sims: success = []
2022-05-18 14:48:56,996 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:567 - DEBUG - [148.312 MB]: [createsim] sims: failed = ['mu18-0ras0raf2ras4A1raf4A-i6_000000001657']
2022-05-18 14:48:56,996 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:568 - DEBUG - [148.312 MB]: [createsim] sims: continue = []
2022-05-18 14:48:56,996 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:582 - INFO - [148.312 MB]: [createsim] processed all jobs. (#jobs: continue = 44, reclaim = 1, cancel = 44), (#sims: success = 0, failed = 1, continue = 0)
2022-05-18 14:48:56,998 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:600 - DEBUG - [148.312 MB]: [createsim] Cancelling simulations (njobs_cancel) = 1
2022-05-18 14:48:56,998 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:cancel:164 - DEBUG - [148.312 MB]: Handle address -- 0x20001ffcb550
2022-05-18 14:48:56,999 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:cancel:166 - DEBUG - [148.312 MB]: Attempting to cancel jobs.
Joblist:
1132764115968
2022-05-18 14:48:56,999 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:cancel:175 - DEBUG - [148.312 MB]: Cancelling Job 1132764115968...
2022-05-18 14:48:57,001 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:cancel:178 - ERROR - [148.312 MB]: [Errno 2] unknown job id
2022-05-18 14:48:57,001 - [wf_manager:164728] mummi_core.workflow.jobTracker:cancel_jobs:800 - ERROR - [148.312 MB]: [createsim] Unknown error: <maestrowf.interfaces.script.CancellationRecord object at 0x20001fffe340>
@bhatiaharsh -- do you happen to know if the job you're requesting to cancel is the failed job? I seem to recall that we try and cancel it anyway because the workflow checks for a dropped file. If that's the case, then it's likely cancelling a job that's been marked as failed.
@bhatiaharsh Just want to follow up on this -- was this resolved?
The current version of maestro (1.1.9) does not work with flux 0.43.0 (running on corona).
The following change is needed to the state method in FluxInterface_0260 to make it work:
maestrowf/interfaces/script/_flux/flux0_26_0.py
@staticmethod
def state(state):
if state == "D":
return State.PENDING
elif state == "S":
return State.QUEUED
elif state == "R":
return State.RUNNING
elif state == "C":
return State.FINISHING
elif state == "CD":
return State.FINISHED
elif state == "F":
return State.FAILED
elif state == "CA":
return State.CANCELLED
elif state == "TO":
return State.TIMEDOUT
else:
LOGGER.error(f"Unhandled state: {state}")
return State.UNKNOWN
The current version of maestro (1.1.9) does not work with flux 0.43.0 (running on corona).
The following change is needed to the state method in FluxInterface_0260 to make it work:
maestrowf/interfaces/script/_flux/flux0_26_0.py
@staticmethod def state(state): if state == "D": return State.PENDING elif state == "S": return State.QUEUED elif state == "R": return State.RUNNING elif state == "C": return State.FINISHING elif state == "CD": return State.FINISHED elif state == "F": return State.FAILED elif state == "CA": return State.CANCELLED elif state == "TO": return State.TIMEDOUT else: LOGGER.error(f"Unhandled state: {state}") return State.UNKNOWN
Thanks for sorting that out. I'll start rolling that up into an updated version of the adapter and get a patch out in the next couple days. @Jmast: Do you need a release pushed up to pypi for this asap or do you just work on the develop branch and don't care much about that?
We are working from released versions of maestro, so It would be very helpful If you could roll this out ASAP to pypi. Thanks, -- Jeff
On Thu, Oct 13, 2022 at 11:00 AM Jeremy White @.***> wrote:
The current version of maestro (1.1.9) does not work with flux 0.43.0 (running on corona).
The following change is needed to the state method in FluxInterface_0260 to make it work:
maestrowf/interfaces/script/_flux/flux0_26_0.py
@staticmethod def state(state): if state == "D": return State.PENDING elif state == "S": return State.QUEUED elif state == "R": return State.RUNNING elif state == "C": return State.FINISHING elif state == "CD": return State.FINISHED elif state == "F": return State.FAILED elif state == "CA": return State.CANCELLED elif state == "TO": return State.TIMEDOUT else: LOGGER.error(f"Unhandled state: {state}") return State.UNKNOWN
Thanks for sorting that out. I'll start rolling that up into an updated version of the adapter and get a patch out in the next couple days. @Jmast https://github.com/Jmast: Do you need a release pushed up to pypi for this asap or do you just work on the develop branch and don't care much about that?
— Reply to this email directly, view it on GitHub https://github.com/LLNL/maestrowf/pull/395#issuecomment-1277916796, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATXDIAIMGTRYRWR2JXCH53WDA53HANCNFSM5WDBWJ7A . You are receiving this because you were mentioned.Message ID: @.***>