ldmx-sw Break-Up Ecal Veto

Edit: I've given this more thought as I chatted with @tvami on slack.

Currently, the EcalVetoProcessor is very hefty. Moreover, a lot of the variables calculated by the veto processor can be used by other analyses. With this in mind, my proposal is to break up the current veto processor into different processors that create different event bus objects.

EcalHitStatistics - calculate various statistics of the hit collection that don't require other inputs. Things like energy-weighted average. These are relatively fast to calculate and don't require other inputs so could be just attached onto the end of the ECal reconstruction chain.
SingleElectronShowerFeatures - the shower-shape features for the single-electron BDT that require additional inputs. Mainly things that depend on knowledge of the incident electron direction like containment. These are not slow themselves, but would require track reconstruction within the Recoil and thus should be separated so that we can pre-select the data on simpler statistics from (1) before taking the time to do track reconstruction.
EcalMIPTracking - reconstruct straight and linear-regression tracks. These are helpful for other analyses (e.g. EaT), but could also be helpful for multi-electron anaylses. In addition, keeping it separate would allow us to avoid running it on already-veto'ed tracks (saving compute time) and enable the potential to develop an acts solution that can be swapped in-and-out as simple as a python config change.
SingleElectronTargetBDT - BDT for separting target-based dark brem from ecal PN. This is to separate it from other ML that could be focused on ecal-based dark brem (EaT) or another ML method (DNN) or other anlaysis channels (multi-electron). Instead of taking a single, partially-formed class as input, it would take in several event objects (pretty much all of the objects produced by the above processors) and then create a new "result" object (or set of simpler objects) that only contain the decision and other metrics (like the inference value).

Hopefully, breaking up the ecal veto in this way will make it more maintain-able. This will also be cause for some additions to the event model, but I can hold off from removing the EcalVetoResult object until we are comfortable with breaking on-disk backwards compatibility.

Apr 21 '21 12:04 tomeichlersmith

+1 on this and in particular, MIP tracking should be pulled out and not run as part of every event (it's a relatively time consuming algorithm and is intended to be used on a subset consisting of tricky events). If it's a separate processor people are free to use it on any set of events they want.

Mar 29 '23 13:03 bryngemark

@tomeichlersmith @bryngemark @vdutta The new version of BDT will include the number of straight tracks from MIP tracking in the feature list. It might be tricky to separate the MIP tracking out. The only way we could imagine is to train two versions of BDT, one with number of straight tracks and one without. But then the usual way of referring to the number of events after the BDT cut will be ambiguous. The time to evaluate two BDTs might be the same scale as running MIP tracking in the first place.

May 29 '24 20:05 danyi211

I'm not saying folks who want to run the BDT should not be required to run MIP tracking, what I'm saying is there are a lot of folks who just use the shower features (like me for instance).

Perhaps the new BDT would require MIP tracking. In this way, the BDT processor would require the MIP Tracking processor to have been run before it. Perhaps another BDT does not require the MIP tracking and as such that other BDT processor would not require the MIP Tracking processor to run before it. Factorizing the code in this way allows for people to opt in for certain requirements if they want. Does that make sense?

May 29 '24 20:05 tomeichlersmith

as a side note, how is it that the BDT requires MIP tracking nowadays? when it was developed, MIP tracking rejected the last 10 events that the BDT didn't already reject. is it entirely unthinkable to have a separate MIP-tracking based veto step (perhaps in the context of a BDT if that's needed) that is only applied after a first BDT fails to reject an event?

Jun 05 '24 20:06 bryngemark

The number of straight and the number of linear-regression MIP tracks are included as feature inputs to the newer BDTs. I think I agree with you that a "fast" BDT would be nice to have since the MIP tracking is time consuming especially on events with a lot of hits that are going to be rejected anyways for other easier reasons.

Another reason (from my point of view) to factorize so that different BDTs and selections can be more explicity about their requirements.

Jun 05 '24 20:06 tomeichlersmith

The BDT will use the number of straight tracks from MIP tracking (not linear regression tracks).

We are considering breaking up EcalVeto into three processors:

Calculate BDT variables except for MIP tracking variables
MIP tracking (number of straight and linear regression tracks)
BDT maker (will require input from the two processors above)

I understand it could be helpful to have a simple BDT without MIP tracks. We can try to compare the performance with and without the straight tracks + additional MIP track selections.

Jun 05 '24 20:06 danyi211

how is it that the BDT requires MIP tracking nowadays?

that's the "mip" in "segmip" :P

I think our plan is to have:

package that makes the variables stored in "EcalShowerFeatures"
package that makes the MIP tracking variable
package that reads in those above and runs the BDT
- we could experiment what happens if we plug "-1"s in the BDT to the MIP tracking features first, and then have a second BDT that does the full fledged "segmip"

Jun 05 '24 20:06 tvami

(sorry Danyi, I think we typed at the same time! [but at least we are saying the same thing :D ])

Jun 05 '24 20:06 tvami

thanks all for your patience, I think I was a little sloppy -- I understand and am aware that it is included, I was mostly wondering why it was considered necessary.

but anyways! it sounds like we have a good path ahead, I think the suggested split sounds really great and will cover the different needs really well.

Jun 06 '24 06:06 bryngemark

I ran a version where MIP tracking is set to -1 for the BDT input in both signal and bkg: https://github.com/LDMX-Software/ldmx-sw/actions/runs/11320865028 The results are not terrible, but clearly I need a bigger stat study.

Signal: EcalVetoResults_EcalVetoResults_bdt_disc.pdf

Kaons: EcalVetoResults_EcalVetoResults_bdt_disc.pdf

EcalPN: EcalVetoResults_EcalVetoResults_bdt_disc.pdf

Oct 14 '24 16:10 tvami

Here is the breakup below of how long things take.

With this we'll really just decouple the MIP tracking i.e from mip_tracking_setup to set_variables and keep the rest as it is, they dont take very long.

So our plan with @JYoo001 is to have an output of "hits along the photon projection" (i.e. trackingHitList) in a collection from EcalVetoProcessor and then have a EcalMipTrackingProcessor that reads that in and makes all the MIP tracking related variables. We plan to keep the BDT rely on EcalVetoProcessor, and use EcalMipTrackingProcessor as a cut based approach. We will also have a flag if we want to run the lin-reg in EcalMipTrackingProcessor or not.

Inclusive:

[ ecalVeto ] 10000  info: AVG Time/Event: 3.75 ms
[ ecalVeto ] 10000  info: Breakdown::
[ ecalVeto ] 10000  info: setup                  Avg Time/Event = 0.010 ms
[ ecalVeto ] 10000  info: recoil_electron        Avg Time/Event = 0.059 ms
[ ecalVeto ] 10000  info: trajectories           Avg Time/Event = 0.006 ms
[ ecalVeto ] 10000  info: roc_var                Avg Time/Event = 0.001 ms
[ ecalVeto ] 10000  info: fill_hitmaps           Avg Time/Event = 0.081 ms
[ ecalVeto ] 10000  info: containment_var        Avg Time/Event = 0.003 ms
[ ecalVeto ] 10000  info: mip_tracking_setup     Avg Time/Event = 0.042 ms
[ ecalVeto ] 10000  info: straight_tracks        Avg Time/Event = 0.015 ms
[ ecalVeto ] 10000  info: linreg_tracks          Avg Time/Event = 3.379 ms
[ ecalVeto ] 10000  info: set_variables           Avg Time/Event = 0.026 ms
[ ecalVeto ] 10000  info: bdt_variables           Avg Time/Event = 0.128 ms

Ecal PN:

[ ecalVeto ] 10000  info: AVG Time/Event: 7.78 ms
[ ecalVeto ] 10000  info: Breakdown::
[ ecalVeto ] 10000  info: setup                  Avg Time/Event = 0.014 ms
[ ecalVeto ] 10000  info: recoil_electron        Avg Time/Event = 0.072 ms
[ ecalVeto ] 10000  info: trajectories           Avg Time/Event = 0.006 ms
[ ecalVeto ] 10000  info: roc_var                Avg Time/Event = 0.001 ms
[ ecalVeto ] 10000  info: fill_hitmaps           Avg Time/Event = 0.111 ms
[ ecalVeto ] 10000  info: containment_var        Avg Time/Event = 0.003 ms
[ ecalVeto ] 10000  info: mip_tracking_setup     Avg Time/Event = 0.057 ms
[ ecalVeto ] 10000  info: straight_tracks        Avg Time/Event = 0.038 ms
[ ecalVeto ] 10000  info: linreg_tracks          Avg Time/Event = 7.312 ms
[ ecalVeto ] 10000  info: set_variables           Avg Time/Event = 0.029 ms
[ ecalVeto ] 10000  info: bdt_variables           Avg Time/Event = 0.137 ms

Kaons:

[ ecalVeto ] 5000  info: AVG Time/Event: 6.23 ms
[ ecalVeto ] 5000  info: Breakdown::
[ ecalVeto ] 5000  info: setup                  Avg Time/Event = 0.017 ms
[ ecalVeto ] 5000  info: recoil_electron        Avg Time/Event = 0.071 ms
[ ecalVeto ] 5000  info: trajectories           Avg Time/Event = 0.007 ms
[ ecalVeto ] 5000  info: roc_var                Avg Time/Event = 0.001 ms
[ ecalVeto ] 5000  info: fill_hitmaps           Avg Time/Event = 0.107 ms
[ ecalVeto ] 5000  info: containment_var        Avg Time/Event = 0.003 ms
[ ecalVeto ] 5000  info: mip_tracking_setup     Avg Time/Event = 0.057 ms
[ ecalVeto ] 5000  info: straight_tracks        Avg Time/Event = 0.031 ms
[ ecalVeto ] 5000  info: linreg_tracks          Avg Time/Event = 5.742 ms
[ ecalVeto ] 5000  info: set_variables           Avg Time/Event = 0.029 ms
[ ecalVeto ] 5000  info: bdt_variables           Avg Time/Event = 0.163 ms
---- LDMXSW: Event processing complete  --------

2e PU:

[ ecalVetoBDT ] 5000  info: AVG Time/Event: 25.33 ms
[ ecalVetoBDT ] 5000  info: Breakdown::
[ ecalVetoBDT ] 5000  info: setup                  Avg Time/Event = 0.019 ms
[ ecalVetoBDT ] 5000  info: recoil_electron        Avg Time/Event = 0.087 ms
[ ecalVetoBDT ] 5000  info: trajectories           Avg Time/Event = 0.007 ms
[ ecalVetoBDT ] 5000  info: roc_var                Avg Time/Event = 0.001 ms
[ ecalVetoBDT ] 5000  info: fill_hitmaps           Avg Time/Event = 0.185 ms
[ ecalVetoBDT ] 5000  info: containment_var        Avg Time/Event = 0.004 ms
[ ecalVetoBDT ] 5000  info: mip_tracking_setup     Avg Time/Event = 0.107 ms
[ ecalVetoBDT ] 5000  info: straight_tracks        Avg Time/Event = 0.089 ms
[ ecalVetoBDT ] 5000  info: linreg_tracks          Avg Time/Event = 24.661 ms
[ ecalVetoBDT ] 5000  info: set_variables           Avg Time/Event = 0.033 ms
[ ecalVetoBDT ] 5000  info: bdt_variables           Avg Time/Event = 0.134 ms

Signal:

[ ecalVeto ] 10000  info: AVG Time/Event: 0.56 ms
[ ecalVeto ] 10000  info: Breakdown::
[ ecalVeto ] 10000  info: setup                  Avg Time/Event = 0.003 ms
[ ecalVeto ] 10000  info: recoil_electron        Avg Time/Event = 0.052 ms
[ ecalVeto ] 10000  info: trajectories           Avg Time/Event = 0.005 ms
[ ecalVeto ] 10000  info: roc_var                Avg Time/Event = 0.001 ms
[ ecalVeto ] 10000  info: fill_hitmaps           Avg Time/Event = 0.025 ms
[ ecalVeto ] 10000  info: containment_var        Avg Time/Event = 0.003 ms
[ ecalVeto ] 10000  info: mip_tracking_setup     Avg Time/Event = 0.021 ms
[ ecalVeto ] 10000  info: straight_tracks        Avg Time/Event = 0.008 ms
[ ecalVeto ] 10000  info: linreg_tracks          Avg Time/Event = 0.275 ms
[ ecalVeto ] 10000  info: set_variables           Avg Time/Event = 0.025 ms
[ ecalVeto ] 10000  info: bdt_variables           Avg Time/Event = 0.136 ms

Reduced:

[ ecalVeto ] 10000  info: AVG Time/Event: 1.31 ms
[ ecalVeto ] 10000  info: Breakdown::
[ ecalVeto ] 10000  info: setup                  Avg Time/Event = 0.004 ms
[ ecalVeto ] 10000  info: recoil_electron        Avg Time/Event = 0.089 ms
[ ecalVeto ] 10000  info: trajectories           Avg Time/Event = 0.003 ms
[ ecalVeto ] 10000  info: roc_var                Avg Time/Event = 0.001 ms
[ ecalVeto ] 10000  info: fill_hitmaps           Avg Time/Event = 0.033 ms
[ ecalVeto ] 10000  info: containment_var        Avg Time/Event = 0.005 ms
[ ecalVeto ] 10000  info: mip_tracking_setup     Avg Time/Event = 0.020 ms
[ ecalVeto ] 10000  info: straight_tracks        Avg Time/Event = 0.036 ms
[ ecalVeto ] 10000  info: linreg_tracks          Avg Time/Event = 0.965 ms
[ ecalVeto ] 10000  info: set_variables           Avg Time/Event = 0.027 ms
[ ecalVeto ] 10000  info: bdt_variables           Avg Time/Event = 0.125 ms

Feb 19 '25 21:02 tvami

I want to push strongly against only pulling out the MIP tracking. The idea behind braking up the processor is not only for performance reasons but also to make the code more maintainable and understandable. If all that can be done at this time is to pull out the tracking, that's okay and definitely a step in the right direction. I just also want to either keep this issue open or open a new one that is still focused on separating the processor into its constituent parts.

Feb 20 '25 14:02 tomeichlersmith

We could try to have more functions it the EcalVeto, but w/o the tracking it's a 500 lines of code so it's not too bad. There is a lot of interdependence in the rest so I'm not fully convinced that having a 3rd producer will help the maintenance / understandability. Form the point of view of the cutflow I'm imagining for the analysis: everything that goes into the BDT should be one producer / veto, and then there is a MIP tracking veto. (With this I'm also saying not to use MIP tracking in the BDT).

Does having more functions in the BDT veto file sound like a compromise to you?

Feb 20 '25 19:02 tvami

Idk maybe it's fine to do what you propose, let's come back to it after we took up the MIP tracking

Feb 20 '25 19:02 tvami

I guess I just see the calculation of these features as naturally disconnected from running a BDT or some other form of ML inference which makes it seem "natural" to me to separate them into different processors. Its not really lines of code to me, its more the purpose of the code.

Feb 20 '25 19:02 tomeichlersmith