flux-sched icon indicating copy to clipboard operation
flux-sched copied to clipboard

Can't find vertex on LC cluster

Open jameshcorbett opened this issue 1 year ago • 2 comments

The following message is being repeated somewhat regularly on a LC cluster with different job IDs.

[ +14.021503] sched-fluxion-resource[0]: run_remove: dfu_traverser_t::remove (id=345577048527356928): add_or_update: couldn't find vertex in graph for cluster1029.
[ +14.022637] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1029.
[ +14.022648] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1030.
[ +14.022653] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1030.
[ +14.022658] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1036.
[ +14.022663] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1036.
[ +14.022668] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1004.
[ +14.022672] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1004.
[ +14.022678] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1015.
[ +14.022683] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1015.
[ +14.022688] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1010.
[ +14.022692] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1010.
[ +14.022697] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1003.
[ +14.022703] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1003.
[ +14.022707] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1002.
[ +14.022723] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1002.
[ +14.022731] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1031.
[ +14.022735] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1031.
[ +14.022739] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1023.
[ +14.022743] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1023.
[ +14.022747] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1039.
[ +14.022751] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1039.
[ +14.022755] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1013.
[ +14.022758] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1013.
[ +14.022762] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1040.
[ +14.022768] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1040.
[ +14.022772] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1032.
[ +14.022775] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1032.
[ +14.022779] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1014.
[ +14.022783] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1014.
[ +14.022787] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1041.
[ +14.022790] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1041.
[ +14.022794] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1037.
[ +14.022798] sched-fluxion-resource[0]: unpack_rank: failed unpacking
[ +14.022808] sched-fluxion-resource[0]: partial_cancel_request_cb: remove fails due to match error (id=345577048527356928): Invalid argument
[ +14.023561] sched-fluxion-qmanager[0]: remove: .free RPC partial cancel failed for jobid 345577048527356928: Invalid argument
[ +14.023577] sched-fluxion-qmanager[0]: jobmanager_free_cb: remove (queue=pdev id=345577048527356928): Invalid argument

Based on the JGF in the KVS, those vertices it can't find should be in the graph.

Interestingly, the job wasn't even using those nodes:

flux jobs 345577048527356928
       JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
 foXVU52MaqD pdev     user1 run.sh      F      8      8   1.563m cluster[1027,1032-1033,1035-1039]

@zekemorton @milroy any thoughts?

jameshcorbett avatar Oct 02 '24 18:10 jameshcorbett