cmssw Improve exception handling in endStream

Currently in endStream() for a given stream, if one Worker::endStream() throws an exception, the endStream() of remaining workers gets ignored in this loop https://github.com/cms-sw/cmssw/blob/695ed7856207edf9efdcd61781836d2d1ba8140d/FWCore/Framework/src/WorkerManager.cc#L129-L133

We should change this loop to similar as WorkerManager::endJob() https://github.com/cms-sw/cmssw/blob/695ed7856207edf9efdcd61781836d2d1ba8140d/FWCore/Framework/src/WorkerManager.cc#L83-L91 that each worker's endStream() is in its own try-catch block.

In addition, we should add logic (if it isn't there yet) that in case worker's beginStream() throws an exception, the endStream() should be called only for those workers whose beginStream() succeeded.

(this issue spurred from https://github.com/cms-sw/cmssw/pull/43814 and in particular https://github.com/cms-sw/cmssw/issues/38260)

Jan 31 '24 20:01 makortel

assign core

Jan 31 '24 20:01 makortel

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

Jan 31 '24 20:01 cmsbuild

cms-bot internal usage

Jan 31 '24 20:01 cmsbuild

A new Issue was created by @makortel Matti Kortelainen.

@smuzaffar, @makortel, @Dr15Jones, @sextonkennedy, @antoniovilela, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Jan 31 '24 20:01 cmsbuild

I'm working to implement the first part of the request in the initial comment above. I should have a PR soon.

Regarding the second part, I think this is not needed. beginJob and beginStream are not like beginRun or beginLumi. When there is an exception in beginJob or beginStream the process just shuts down immediately. We currently do not try to execute the endJob or endStream transitions at all. endJob and endStream are not run for any module nor any stream. This makes sense because there is not any data we can even imagine we might recover. The situation is different for runs and lumis, because we might imagine we've already processed many runs and lumis successfully and we want to exit cleanly to save those results either permanently or at least for debugging (even in that case I've always wondered if anyone actually uses this ability...).

There is only one beginJob transition and beginStream transition. If they fail, then nothing useful has happened yet that would be worth saving. There is no output file. The best thing is just to exit as soon as possible and try to get the exception message printed to help debug the problem. Any action after risks a crash (seg fault) instead of a clean exit with a non-zero exit value. There's no benefit. I think we are doing the right thing in this case already.

Maybe I am missing something. Is there a purpose to run endStream and endJob after an exception in beginJob or beginStream?

Mar 01 '24 17:03 wddgit

Maybe I am missing something. Is there a purpose to run endStream and endJob after an exception in beginJob or beginStream?

The only thing I could think of if in the logic of the destructor, it assumes that the 'end' was called if the 'begin' was called and if it isn't, it would lead to a crash. I don't really think such behavior is reasonable nor very likely.

Mar 01 '24 19:03 Dr15Jones

(should have recorded this earlier, IIRC the context for this issue was https://github.com/cms-sw/cmssw/pull/43814 and in particular https://github.com/cms-sw/cmssw/issues/38260)

We have https://github.com/cms-sw/cmssw/blob/85c8117751a37223e789fa5849903e548e10b859/HeterogeneousCore/SonicTriton/interface/TritonEDProducer.h#L29-L34 (the this->client_ is of type TritonClient, and is constructed in beginStream()).

Looking at TritonService, when it starts the fallback server, it does it in preBeginJob, and shuts it down in postEndJob.

If an exception is thrown during beginJob/beginStream, do we call the ActivityRegistry pre/postEndJob signals?

I have a feeling this "communicating to external thing" use case might need a bit more thought (and possibly changes in the Sonic side).

Mar 01 '24 21:03 makortel

If an exception is thrown during beginJob/beginStream, do we call the ActivityRegistry pre/postEndJob signals?

No. Not in the existing version of the code. If there is a reason, I suppose we could change that.

Mar 01 '24 22:03 wddgit

In cmsRun.cpp, I think proc.on is telling the sentry to call endJob if we are exiting with an exception. It is only turned on after beginJob.

Mar 01 '24 22:03 wddgit

Not that it matters, but I looked in the history. It's been that way since April 25 2006. Bill added that.

Mar 01 '24 22:03 wddgit

Notes from discussion

When looping over the callbacks of a signal, if one callback throws an exception, do we process the rest of the callbacks?
- Order of callbacks is unspecified, would be good to process all callbacks of a signal
If preX signal is run, the postX signal should be run too (this should already be the current behavior)
For data processing transitions, we want the endTransition to run for a module only if a beginTransition was successfully run for that module (really about https://github.com/cms-sw/cmssw/issues/42501)
- If a module's beginTransition throws an exception, the corresponding endTransition should not be called
Regarding summary producing, like lumi summary. The ideal behavior would be that if a streamBeginLumi throws an exception, follow the semantics of a stream skipping an empty lumi and process the lumiSummary by skipping the stream that threw the exception; if a streamEndLumi throws an exception, skip the summary production.
- We don't necessarily have to implement it this way though

Mar 04 '24 20:03 makortel

Adding to notes from our discussion:

We agreed that we would add more WorkerManagers so that we could support the Worker containing a single bool that records whether the begin transition succeeded. This will require having 3 WorkerManagers in StreamSchedule instead of 1 WorkerManager, one for runs, one for lumi, and one for beginStream/endStream. There is 1 StreamSchedule per stream. The number of WorkerManagers devoted to streams triples.

For global transitions, there is already 1 WorkerManager per concurrent run plus 1 WorkerManager per concurrent lumi. We currently reuse the 0th WorkerManager in the vector that holds the run and lumi global WorkerManagers for beginJob/endJob and beginProcessBlock/endProcessBlock. So we would need to add 2 more WorkerManagers to handle those global cases.

The alternative would be to add 4 bool's to each Worker.

bool beginJobOrStreamSucceeded_; bool beginProcessBlockSucceeded_; bool beginRunSucceeded_; bool beginLumiSucceeded_;

I think it will work either way. The second way would save a little memory and there would be less WorkerManagers... Maybe this is not significant. The downside of the second approach is that some Workers would actually use only 1 of the 4 bools and some would use 3 of the 4 bools and there is a little additional complication dealing with that.

I mention it now because there will be some nontrivial rework to implement it one way and then change to other way because we change our minds. I wanted to document that decision. I'm starting down the path of implementation with additional WorkerManagers now.

Mar 07 '24 16:03 wddgit