FairMQ icon indicating copy to clipboard operation
FairMQ copied to clipboard

Remove catchall-rethrow indirections in the Device state wrappers and as a consequence also the ERROR transition

Open dennisklein opened this issue 4 years ago • 4 comments

In an offline discussion we (@rbx, @davidrohr, and myself) agreed (correct me if I am wrong) that having a cooperative signaling about an imminent terminate inside the about-to-terminate process is redundant, because a controller has to monitor and handle terminate events anyways from outside. Removing the internal signaling logic (catchall-transition-to-ERROR-rethrow) will improve debuggability.

This likely opens the possibility to remove the FairMQ ERROR state completely AFAICS. If we do not find any surprises, I suggest to remove it together with the change above.

Once we have a PR ready, let us collect approval by all Alice O2 subsystems, that might potentially be affected, since I really lost overview who could be relying on the ERROR state): EPN, DD, DPL, QC, FLP. Anyone else?

dennisklein avatar Jun 21 '21 09:06 dennisklein

I think to remove it completely we should at least have something equivalent in the SDK that would notify of the termination. Something on the controller side or in between should switch the device state from whatever it was when error happened to an Error State or similar.

I think leaving the ErrorState here is the simplest, even if the device doesn't automatically go into it (which we anyway cannot guarantee for crashes/terminations). Even if only for a reason that some controller may already have it in use. But perhaps it needs to be extended with further infos, like exit code.

rbx avatar Jun 21 '21 09:06 rbx

I think to remove it completely we should at least have something equivalent in the SDK that would notify of the termination. Something on the controller side or in between should switch the device state from whatever it was when error happened to an Error State or similar.

I think leaving the ErrorState here is the simplest, even if the device doesn't automatically go into it (which we anyway cannot guarantee for crashes/terminations). Even if only for a reason that some controller may already have it in use. But perhaps it needs to be extended with further infos, like exit code.

True, then let's revisit a possible removal of the ERROR state once we have proper DDS APIs to monitor the task termination from the SDK. (cc: @AnarManafov, @AndreyLebedev just FYI)

dennisklein avatar Jun 21 '21 09:06 dennisklein

@dennisklein , DDS ToolAPI now supports subscription on onTaskDone events. I have finished the implementation and pushed the code to the master last week. Example:

// create a session
SOnTaskDoneRequest::request_t request;
SOnTaskDoneRequest::ptr_t requestPtr = SOnTaskDoneRequest::makeRequest(request);
int nTaskDoneCount{ 0 };
requestPtr->setResponseCallback(
               [&nTaskDoneCount](const SOnTaskDoneResponseData& _info)
               {
                 ++nTaskDoneCount;
                 cout << "Recieved onTaskDone event. TaskID: " << _info.m_taskID
                      << " ; ExitCode: " << _info.m_exitCode
                      << " ; Signal: " << _info.m_signal;
               });
 session.sendRequest<SOnTaskDoneRequest>(requestPtr);
  ... 

AnarManafov avatar Jun 21 '21 10:06 AnarManafov

for reference: AliceO2Group/AliceO2#6499

dennisklein avatar Jun 29 '21 14:06 dennisklein