FairRoot Selective skipping of further event processing

Selective skipping of further event processing

Open klenze opened this issue 10 months ago • 4 comments

A common theme for any particular analysis of beam data is that one is not interested in every event, but only in some of them, e.g.:

events which occurred during spills
events in which there was no pileup
events with certain trigger conditions
events in which a certain detector saw a certain thing

At the moment, the way R3BRoot would implement the last cut (e.g. (p,2p) in califa) is to create a R3BCalifaClusterToP2PStatus which registers a TClonesArray of type R3BCalifaP2PStatusData which contains a single bool member to store the status. Then the R3BAwesomeHistogramTask would fetch the TCA and query its first element (if available) for the condition.

This would have to be repeated for every other cut (and every histogramming task). Even for events which are clearly not of interest, all the analysis steps would have to be run. And the output event tree always contains all the non-interesting events and would have to be filtered by an external "macro".

What I propose is to give any FairTask the ability to Nope further processing of the event by subsequent tasks (e.g. by calling their FairTask::CancelEvent()). The challenge here is to make that transparent to the user, if some unimportant detector decides to silently skip half of the events this will be painful for the users.

Therefore, I propose that

The first time any task cancels an event, a LOG line to that effect is written
After event processing is done, for every processor which cancelled at least one event, the number of events they cancelled is LOGged again.

Yes, "just skip processing that event" is not an elegant solution. I would argue that it is an instance of worse is better.

Unlike other event processing frameworks like e.g. Marlin, FairRoot has no way to specify conditional execution of certain tasks, and the syntax of FairRoot steering files (ROOT "C" macros) does not allow to gracefully add them. Nor would C++ lambda expressions be a good way to specify conditions for running a task, because the condition is likely to depend on specific FairRootManager managed objects which are not yet registered when the task is configured (and whose lookup-by-string in the event loop would be too expensive). (I guess one could use static variables to initialize these objects when the condition is evaluated for the first time, but I don't think this would be very readable.)

The ability to skip events is of course not equivalent to conditional task execution, but it should be easy to implement and would (imho) be helpful.

Aug 11 '23 23:08 klenze

If I understand you correctly, you could try TTask::SetBreakout() function (FairTask is derived from TTask). This basically stops all Exec() of the following tasks of the event. If you want to stops only a certain tasks, rather than all tasks, then you could chain those tasks into a single Task. I'm not sure it could work as I haven't tried this out. In any case, please check the source code of TTask for details.

Here begins the crazy idea: We should just write our own FairTask, independent of TTask. In the ROOT website, it's clearly stated that:

TTask is a legacy interface: there will be no bug fixes nor new developments. Therefore it is not recommended to use it in new long-term production code. But, depending on the context, using TTask might still be a valid solution.

If there is a bug in TTask, ROOT will never fix that.

Sep 01 '23 19:09 YanzhaoW

I tried reading the TTask source code and completely concur with ROOT's decision to make it obsolete. (I don't know why anyone would want a hierarchy of tasks instead of a list. IMHO, if a list is insufficient then you probably want a dependency graph. Also, why should every task have a list of subtasks? The file system equivalent would be to make every file also a directory.)

I agree that inheriting from TTask is not a good idea. It would be better to implement the basic functionality (no subtasks) in FairTask.

If FairRoot relies on TTask::ExecuteTasks(), SetBreakout could indeed be used for that. However, I would very much prefer not to use an undocumented feature in decades old legacy code for that. Besides, the logging is an important part of my proposal (because skipping events without logging will probably cause more trouble than it saves).

Sep 03 '23 01:09 klenze

The file system equivalent would be to make every file also a directory.

Well, for unix system, everything is file. And a directory is just a file containing the names of other files.

To be honest, I can't find any better structure than the tree structure in TTask. A list or a dependency graph is certainly not enough because we can't have task groups. Sometimes it's important to put different tasks into different groups such that they are not influenced by each other in any other way. Sometimes a group also needs some subgroups and a subgroup may need subsubgroups. And a tree structure is a pretty good choice to achieve this.

It would be better to implement the basic functionality (no subtasks) in FairTask.

I think subtask is very necessary as it can make tasks more modular and less coupling. The user, on the top level, only need to add one task, without concerning about the details inside. The developer, on the bottom level, is free to implement more or less subtasks when necessary without changing anything on the user's side.

Besides, the logging is an important part of my proposal (because skipping events without logging will probably cause more trouble than it saves).

This can be implemented in the task or by a global logger. I don't think it's necessary for FairTask to take care of logging. However, what FairTask or TTask could do better is to have a FinishSkippedEvent() function and call this function when the event is skipped. In other cases, call FinishEvent() when the event is not skipped. And then you could implement the logging yourself in the FinishSkippedEvent().

Sep 04 '23 11:09 YanzhaoW

Sometimes it's important to put different tasks into different groups such that they are not influenced by each other in any other way.

Without event skipping, the tasks will in fact be executed in some linear fashion (because we need intra-event multithreading like a gasoline spill needs a match). All of them depend on global state, so they can always influence each other.

With regard to the implementation of subtrees, let us agree to disagree. Some situations call for trees where there is paydata in every node (e.g. search trees), some call for trees where the data is stored only in the leafs (file systems). If every Task can have subtasks, one would have to decide if the main task gets executed before or after the subtasks. I would have a TaskList: public Task which only runs its subtasks. You could still nest them.

I think dependency tracking would be superior to explicit task grouping.

We could have

template <class T>
class Dependency: DependencyBase
{
Dependency(const std::string name);
T* operator->(); // or something
};

class BetterTask
{
...
protected:
  template<class T>
  Dependency<T> AddDependency(const std::string name);
private:
  std::vector<DependencyBase> fDependencies;
};
namespace califa
{
class Mapped2Cal: public BetterTask
{
   auto fMapped=AddDependency<r3b::califa::mapped>("CalifaMappedData");
   ...
};
};

Same for outputs. (For subsequent tasks which modify the same data structure, one could have an int inputVersion, int versionIncrement as optional arguments or something).

If someone instantiates califa::Mapped2Cal, FR could eventually figure out how to instantiate a Task which has a califa::mapped named "CalifaMappedData". Or at least which goes first. This would get us closer to the world where we do not have to use a Turing-complete language (ROOT macros) to configure jobs.

As a stopgap, BetterTask could be implemented on the back of the existing FairTask infrastructure (without dependency tracking or implicit task instantiation, then). It should inherit privately from FairTask and handle all the SetParameter and Init stuff once and for all, and just expose a virtual int Run()=0; method (called from legacy void Exec(Option_t*)).

Sep 04 '23 12:09 klenze

FairRoot FairRoot copied to clipboard

Selective skipping of further event processing

FairRoot
FairRoot copied to clipboard