pacemaker icon indicating copy to clipboard operation
pacemaker copied to clipboard

WIP: Make remote reads asynchronous

Open clumens opened this issue 5 months ago • 4 comments

I've been working through the comments in T855 and everything's been pretty easy up to the lrmd_tls_dispatch patch. When running the RemoteMigrate test in cts-lab, I'm seeing this:

Sep 03 12:01:41.043 rhel9-ctslab-1 pacemaker-controld  [60448] (lrmd_tls_dispatch_async@lrmd_client.c:490)      trace: TLS dispatch triggered
Sep 03 12:01:41.043 rhel9-ctslab-1 pacemaker-controld  [60448] (lrmd_tls_dispatch_async@lrmd_client.c:498)      info: BEFORE REMOTE READY
Sep 03 12:01:41.043 rhel9-ctslab-1 pacemaker-controld  [60448] (lrmd_tls_dispatch_async@lrmd_client.c:500)      info: AFTER REMOTE READY
Sep 03 12:01:41.043 rhel9-ctslab-1 pacemaker-controld  [60448] (lrmd_tls_dispatch_async@lrmd_client.c:503)      info: READY SAID OK
Sep 03 12:01:41.043 rhel9-ctslab-1 pacemaker-controld  [60448] (lrmd_tls_dispatch_async@lrmd_client.c:519)      info: BEFORE READ AVAIL
Sep 03 12:01:41.043 rhel9-ctslab-1 pacemaker-controld  [60448] (lrmd_tls_dispatch_async@lrmd_client.c:521)      info: AFTER READ AVAIL
Sep 03 12:01:41.043 rhel9-ctslab-1 pacemaker-controld  [60448] (lrmd_tls_dispatch_async@lrmd_client.c:529)      info: READ SAID AGAIN
Sep 03 12:01:41.045 rhel9-ctslab-1 pacemaker-schedulerd[60447] ([email protected]:1747)  info: rhel9-ctslab-1 is active
Sep 03 12:01:41.045 rhel9-ctslab-1 pacemaker-schedulerd[60447] ([email protected]:1911)  info: rhel9-ctslab-1 is online
Sep 03 12:01:41.045 rhel9-ctslab-1 pacemaker-schedulerd[60447] ([email protected]:1747)  info: rhel9-ctslab-2 is active
Sep 03 12:01:41.045 rhel9-ctslab-1 pacemaker-schedulerd[60447] ([email protected]:1911)  info: rhel9-ctslab-2 is online
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning remote-rhel9-ctslab-3
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning Fencing
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning FencingFail
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning rsc_rhel9-ctslab-1
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning rsc_rhel9-ctslab-2
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning rsc_rhel9-ctslab-3
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning migrator
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning ping-1:0
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning ping-1:1
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning ping-1:2
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning ping-1:3
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__primitive_assign@pcmk_sched_primitive.c:542)      info: Resource ping-1:3 cannot run anywhere
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning stateful-1:0
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning stateful-1:1
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning stateful-1:2
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__primitive_assign@pcmk_sched_primitive.c:542)      info: Resource stateful-1:2 cannot run anywhere
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (set_instance_role@pcmk_sched_promotable.c:1065)         info: Choosing stateful-1:0 (Promoted) on rhel9-ctslab-1 for promotion
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__set_instance_roles@pcmk_sched_promotable.c:1097)  info: promotable-1: Promoted 1 instances of a possible 1
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning r192.168.122.204
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning petulant
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning r192.168.122.205
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__unassign_resource@pcmk_sched_resource.c:548)      info: Unassigning lsb-dummy
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (recurring_op_for_active@pcmk_sched_recurring.c:339)     info: Start 1m-interval monitor for ping-1:2 on remote-rhel9-ctslab-3
Sep 03 12:01:41.048 rhel9-ctslab-1 pacemaker-schedulerd[60447] (recurring_op_for_active@pcmk_sched_recurring.c:339)     info: Start 20s-interval monitor for remote-rhel9-ctslab-3 on rhel9-cts
lab-1
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1206)  info: Leave   Fencing   (Started rhel9-ctslab-1)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1206)  info: Leave   FencingFail       (Started rhel9-ctslab-2)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1206)  info: Leave   rsc_rhel9-ctslab-1        (Started rhel9-ctslab-1)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1206)  info: Leave   rsc_rhel9-ctslab-2        (Started rhel9-ctslab-2)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1206)  info: Leave   rsc_rhel9-ctslab-3        (Started rhel9-ctslab-1)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1206)  info: Leave   migrator  (Started rhel9-ctslab-2)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1206)  info: Leave   ping-1:0  (Started rhel9-ctslab-1)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1206)  info: Leave   ping-1:1  (Started rhel9-ctslab-2)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (log_list_item@output_log.c:232)         notice: Actions: Start      ping-1:2                  (                 remote-rhel9-ct
slab-3 )
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1123)  info: Leave   ping-1:3  (Stopped)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1206)  info: Leave   stateful-1:0      (Promoted rhel9-ctslab-1)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1206)  info: Leave   stateful-1:1      (Unpromoted rhel9-ctslab-2)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1123)  info: Leave   stateful-1:2      (Stopped)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1206)  info: Leave   r192.168.122.204  (Started rhel9-ctslab-1)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1206)  info: Leave   r192.168.122.205  (Started rhel9-ctslab-1)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1206)  info: Leave   lsb-dummy (Started rhel9-ctslab-2)
Sep 03 12:01:41.049 rhel9-ctslab-1 pacemaker-schedulerd[60447] (rsc_action_default@pcmk_output.c:1206)  info: Leave   remote-rhel9-ctslab-3     (Started rhel9-ctslab-1)
Sep 03 12:01:41.050 rhel9-ctslab-1 pacemaker-schedulerd[60447] (pcmk__log_transition_summary@pcmk_graph_producer.c:965)         warning: Calculated transition 10 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-1264.bz2
Sep 03 12:01:41.062 rhel9-ctslab-1 pacemaker-controld  [60448] (do_state_transition@controld_fsa.c:612)         info: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_reply
Sep 03 12:01:41.063 rhel9-ctslab-1 pacemaker-controld  [60448] (do_te_invoke@controld_transition.c:171)         info: Processing graph 10 (ref=pe_calc-60448_pacemaker-controld-1725379301-117) derived from /var/lib/pacemaker/pengine/pe-warn-1264.bz2
Sep 03 12:01:41.063 rhel9-ctslab-1 pacemaker-controld  [60448] (execute_rsc_action@controld_te_actions.c:404)   notice: Initiating monitor operation rsc_rhel9-ctslab-1_monitor_0 locally on remote-rhel9-ctslab-3 | action 14
Sep 03 12:01:41.063 rhel9-ctslab-1 pacemaker-controld  [60448] (lrmd_send_command@lrmd_client.c:980)    trace: Sending lrmd_rsc_info op to executor
Sep 03 12:01:41.064 rhel9-ctslab-1 pacemaker-controld  [60448] (lrmd_create_op@lrmd_client.c:640)       trace: Created executor lrmd_rsc_info command with call options 00000000 (0)
Sep 03 12:01:41.064 rhel9-ctslab-1 pacemaker-controld  [60448] ([email protected]:270)   error: pcmk__remote_message_xml: Triggered fatal assertion at remote.c:564 : remote->buffer[sizeof(struct remote_header_v0) + header->payload_uncompressed - 1] == 0
Sep 03 12:01:41.064 rhel9-ctslab-1 pacemaker-controld  [60448] (pcmk__log_xmllib_err@xml_display.c:39)  error: XML Error: Entity: line 1: parser error : Document is empty
Sep 03 12:01:41.064 rhel9-ctslab-1 pacemaker-controld  [60448] (pcmk__log_xmllib_err@xml_display.c:39)  error: XML Error: 
Sep 03 12:01:41.064 rhel9-ctslab-1 pacemaker-controld  [60448] (pcmk__log_xmllib_err@xml_display.c:39)  error: XML Error: ^
Sep 03 12:01:41.064 rhel9-ctslab-1 pacemaker-controld  [60448] (pcmk__log_xmllib_err@xml_display.c:39)  error: XML Error: Entity: line 1: parser error : Document is empty
Sep 03 12:01:41.064 rhel9-ctslab-1 pacemaker-controld  [60448] (pcmk__log_xmllib_err@xml_display.c:39)  error: XML Error: 
Sep 03 12:01:41.064 rhel9-ctslab-1 pacemaker-controld  [60448] (pcmk__log_xmllib_err@xml_display.c:39)  error: XML Error: ^
Sep 03 12:01:41.064 rhel9-ctslab-1 pacemaker-controld  [60448] ([email protected]:572)  error: Couldn't parse: ''
Sep 03 12:01:41.064 rhel9-ctslab-1 pacemaker-controld  [60448] ([email protected]:270)   error: localized_remote_header: Triggered fatal assertion at remote.c:105 : endian == ENDIAN_LOCAL

And then the errors go on from there. It looks like we notice there's going to be more to the message, but then never get back around to reading the next chunk and then somewhere, an empty message is being read. I would have expected to see "BEFORE REMOTE MSG" in the log if it were coming from lrmd_tls_dispatch_async. I could use a pointer on how I might further debug this.

clumens avatar Sep 03 '24 16:09 clumens