Break-Up Ecal Veto
Edit: I've given this more thought as I chatted with @tvami on slack.
Currently, the EcalVetoProcessor is very hefty. Moreover, a lot of the variables calculated by the veto processor can be used by other analyses. With this in mind, my proposal is to break up the current veto processor into different processors that create different event bus objects.
- EcalHitStatistics - calculate various statistics of the hit collection that don't require other inputs. Things like energy-weighted average. These are relatively fast to calculate and don't require other inputs so could be just attached onto the end of the ECal reconstruction chain.
- SingleElectronShowerFeatures - the shower-shape features for the single-electron BDT that require additional inputs. Mainly things that depend on knowledge of the incident electron direction like containment. These are not slow themselves, but would require track reconstruction within the Recoil and thus should be separated so that we can pre-select the data on simpler statistics from (1) before taking the time to do track reconstruction.
- EcalMIPTracking - reconstruct straight and linear-regression tracks. These are helpful for other analyses (e.g. EaT), but could also be helpful for multi-electron anaylses. In addition, keeping it separate would allow us to avoid running it on already-veto'ed tracks (saving compute time) and enable the potential to develop an acts solution that can be swapped in-and-out as simple as a python config change.
- SingleElectronTargetBDT - BDT for separting target-based dark brem from ecal PN. This is to separate it from other ML that could be focused on ecal-based dark brem (EaT) or another ML method (DNN) or other anlaysis channels (multi-electron). Instead of taking a single, partially-formed class as input, it would take in several event objects (pretty much all of the objects produced by the above processors) and then create a new "result" object (or set of simpler objects) that only contain the decision and other metrics (like the inference value).
Hopefully, breaking up the ecal veto in this way will make it more maintain-able. This will also be cause for some additions to the event model, but I can hold off from removing the EcalVetoResult object until we are comfortable with breaking on-disk backwards compatibility.
+1 on this and in particular, MIP tracking should be pulled out and not run as part of every event (it's a relatively time consuming algorithm and is intended to be used on a subset consisting of tricky events). If it's a separate processor people are free to use it on any set of events they want.
@tomeichlersmith @bryngemark @vdutta The new version of BDT will include the number of straight tracks from MIP tracking in the feature list. It might be tricky to separate the MIP tracking out. The only way we could imagine is to train two versions of BDT, one with number of straight tracks and one without. But then the usual way of referring to the number of events after the BDT cut will be ambiguous. The time to evaluate two BDTs might be the same scale as running MIP tracking in the first place.
I'm not saying folks who want to run the BDT should not be required to run MIP tracking, what I'm saying is there are a lot of folks who just use the shower features (like me for instance).
Perhaps the new BDT would require MIP tracking. In this way, the BDT processor would require the MIP Tracking processor to have been run before it. Perhaps another BDT does not require the MIP tracking and as such that other BDT processor would not require the MIP Tracking processor to run before it. Factorizing the code in this way allows for people to opt in for certain requirements if they want. Does that make sense?
as a side note, how is it that the BDT requires MIP tracking nowadays? when it was developed, MIP tracking rejected the last 10 events that the BDT didn't already reject. is it entirely unthinkable to have a separate MIP-tracking based veto step (perhaps in the context of a BDT if that's needed) that is only applied after a first BDT fails to reject an event?
The number of straight and the number of linear-regression MIP tracks are included as feature inputs to the newer BDTs. I think I agree with you that a "fast" BDT would be nice to have since the MIP tracking is time consuming especially on events with a lot of hits that are going to be rejected anyways for other easier reasons.
Another reason (from my point of view) to factorize so that different BDTs and selections can be more explicity about their requirements.
The BDT will use the number of straight tracks from MIP tracking (not linear regression tracks).
We are considering breaking up EcalVeto into three processors:
- Calculate BDT variables except for MIP tracking variables
- MIP tracking (number of straight and linear regression tracks)
- BDT maker (will require input from the two processors above)
I understand it could be helpful to have a simple BDT without MIP tracks. We can try to compare the performance with and without the straight tracks + additional MIP track selections.
how is it that the BDT requires MIP tracking nowadays?
that's the "mip" in "segmip" :P
I think our plan is to have:
- package that makes the variables stored in "EcalShowerFeatures"
- package that makes the MIP tracking variable
- package that reads in those above and runs the BDT
- we could experiment what happens if we plug "-1"s in the BDT to the MIP tracking features first, and then have a second BDT that does the full fledged "segmip"
(sorry Danyi, I think we typed at the same time! [but at least we are saying the same thing :D ])
thanks all for your patience, I think I was a little sloppy -- I understand and am aware that it is included, I was mostly wondering why it was considered necessary.
but anyways! it sounds like we have a good path ahead, I think the suggested split sounds really great and will cover the different needs really well.
I ran a version where MIP tracking is set to -1 for the BDT input in both signal and bkg: https://github.com/LDMX-Software/ldmx-sw/actions/runs/11320865028 The results are not terrible, but clearly I need a bigger stat study.
Signal: EcalVetoResults_EcalVetoResults_bdt_disc.pdf
Here is the breakup below of how long things take.
With this we'll really just decouple the MIP tracking i.e from mip_tracking_setup to set_variables and keep the rest as it is, they dont take very long.
So our plan with @JYoo001 is to have an output of "hits along the photon projection" (i.e. trackingHitList) in a collection from EcalVetoProcessor and then have a EcalMipTrackingProcessor that reads that in and makes all the MIP tracking related variables. We plan to keep the BDT rely on EcalVetoProcessor, and use EcalMipTrackingProcessor as a cut based approach. We will also have a flag if we want to run the lin-reg in EcalMipTrackingProcessor or not.
Inclusive:
[ ecalVeto ] 10000 info: AVG Time/Event: 3.75 ms
[ ecalVeto ] 10000 info: Breakdown::
[ ecalVeto ] 10000 info: setup Avg Time/Event = 0.010 ms
[ ecalVeto ] 10000 info: recoil_electron Avg Time/Event = 0.059 ms
[ ecalVeto ] 10000 info: trajectories Avg Time/Event = 0.006 ms
[ ecalVeto ] 10000 info: roc_var Avg Time/Event = 0.001 ms
[ ecalVeto ] 10000 info: fill_hitmaps Avg Time/Event = 0.081 ms
[ ecalVeto ] 10000 info: containment_var Avg Time/Event = 0.003 ms
[ ecalVeto ] 10000 info: mip_tracking_setup Avg Time/Event = 0.042 ms
[ ecalVeto ] 10000 info: straight_tracks Avg Time/Event = 0.015 ms
[ ecalVeto ] 10000 info: linreg_tracks Avg Time/Event = 3.379 ms
[ ecalVeto ] 10000 info: set_variables Avg Time/Event = 0.026 ms
[ ecalVeto ] 10000 info: bdt_variables Avg Time/Event = 0.128 ms
Ecal PN:
[ ecalVeto ] 10000 info: AVG Time/Event: 7.78 ms
[ ecalVeto ] 10000 info: Breakdown::
[ ecalVeto ] 10000 info: setup Avg Time/Event = 0.014 ms
[ ecalVeto ] 10000 info: recoil_electron Avg Time/Event = 0.072 ms
[ ecalVeto ] 10000 info: trajectories Avg Time/Event = 0.006 ms
[ ecalVeto ] 10000 info: roc_var Avg Time/Event = 0.001 ms
[ ecalVeto ] 10000 info: fill_hitmaps Avg Time/Event = 0.111 ms
[ ecalVeto ] 10000 info: containment_var Avg Time/Event = 0.003 ms
[ ecalVeto ] 10000 info: mip_tracking_setup Avg Time/Event = 0.057 ms
[ ecalVeto ] 10000 info: straight_tracks Avg Time/Event = 0.038 ms
[ ecalVeto ] 10000 info: linreg_tracks Avg Time/Event = 7.312 ms
[ ecalVeto ] 10000 info: set_variables Avg Time/Event = 0.029 ms
[ ecalVeto ] 10000 info: bdt_variables Avg Time/Event = 0.137 ms
Kaons:
[ ecalVeto ] 5000 info: AVG Time/Event: 6.23 ms
[ ecalVeto ] 5000 info: Breakdown::
[ ecalVeto ] 5000 info: setup Avg Time/Event = 0.017 ms
[ ecalVeto ] 5000 info: recoil_electron Avg Time/Event = 0.071 ms
[ ecalVeto ] 5000 info: trajectories Avg Time/Event = 0.007 ms
[ ecalVeto ] 5000 info: roc_var Avg Time/Event = 0.001 ms
[ ecalVeto ] 5000 info: fill_hitmaps Avg Time/Event = 0.107 ms
[ ecalVeto ] 5000 info: containment_var Avg Time/Event = 0.003 ms
[ ecalVeto ] 5000 info: mip_tracking_setup Avg Time/Event = 0.057 ms
[ ecalVeto ] 5000 info: straight_tracks Avg Time/Event = 0.031 ms
[ ecalVeto ] 5000 info: linreg_tracks Avg Time/Event = 5.742 ms
[ ecalVeto ] 5000 info: set_variables Avg Time/Event = 0.029 ms
[ ecalVeto ] 5000 info: bdt_variables Avg Time/Event = 0.163 ms
---- LDMXSW: Event processing complete --------
2e PU:
[ ecalVetoBDT ] 5000 info: AVG Time/Event: 25.33 ms
[ ecalVetoBDT ] 5000 info: Breakdown::
[ ecalVetoBDT ] 5000 info: setup Avg Time/Event = 0.019 ms
[ ecalVetoBDT ] 5000 info: recoil_electron Avg Time/Event = 0.087 ms
[ ecalVetoBDT ] 5000 info: trajectories Avg Time/Event = 0.007 ms
[ ecalVetoBDT ] 5000 info: roc_var Avg Time/Event = 0.001 ms
[ ecalVetoBDT ] 5000 info: fill_hitmaps Avg Time/Event = 0.185 ms
[ ecalVetoBDT ] 5000 info: containment_var Avg Time/Event = 0.004 ms
[ ecalVetoBDT ] 5000 info: mip_tracking_setup Avg Time/Event = 0.107 ms
[ ecalVetoBDT ] 5000 info: straight_tracks Avg Time/Event = 0.089 ms
[ ecalVetoBDT ] 5000 info: linreg_tracks Avg Time/Event = 24.661 ms
[ ecalVetoBDT ] 5000 info: set_variables Avg Time/Event = 0.033 ms
[ ecalVetoBDT ] 5000 info: bdt_variables Avg Time/Event = 0.134 ms
Signal:
[ ecalVeto ] 10000 info: AVG Time/Event: 0.56 ms
[ ecalVeto ] 10000 info: Breakdown::
[ ecalVeto ] 10000 info: setup Avg Time/Event = 0.003 ms
[ ecalVeto ] 10000 info: recoil_electron Avg Time/Event = 0.052 ms
[ ecalVeto ] 10000 info: trajectories Avg Time/Event = 0.005 ms
[ ecalVeto ] 10000 info: roc_var Avg Time/Event = 0.001 ms
[ ecalVeto ] 10000 info: fill_hitmaps Avg Time/Event = 0.025 ms
[ ecalVeto ] 10000 info: containment_var Avg Time/Event = 0.003 ms
[ ecalVeto ] 10000 info: mip_tracking_setup Avg Time/Event = 0.021 ms
[ ecalVeto ] 10000 info: straight_tracks Avg Time/Event = 0.008 ms
[ ecalVeto ] 10000 info: linreg_tracks Avg Time/Event = 0.275 ms
[ ecalVeto ] 10000 info: set_variables Avg Time/Event = 0.025 ms
[ ecalVeto ] 10000 info: bdt_variables Avg Time/Event = 0.136 ms
Reduced:
[ ecalVeto ] 10000 info: AVG Time/Event: 1.31 ms
[ ecalVeto ] 10000 info: Breakdown::
[ ecalVeto ] 10000 info: setup Avg Time/Event = 0.004 ms
[ ecalVeto ] 10000 info: recoil_electron Avg Time/Event = 0.089 ms
[ ecalVeto ] 10000 info: trajectories Avg Time/Event = 0.003 ms
[ ecalVeto ] 10000 info: roc_var Avg Time/Event = 0.001 ms
[ ecalVeto ] 10000 info: fill_hitmaps Avg Time/Event = 0.033 ms
[ ecalVeto ] 10000 info: containment_var Avg Time/Event = 0.005 ms
[ ecalVeto ] 10000 info: mip_tracking_setup Avg Time/Event = 0.020 ms
[ ecalVeto ] 10000 info: straight_tracks Avg Time/Event = 0.036 ms
[ ecalVeto ] 10000 info: linreg_tracks Avg Time/Event = 0.965 ms
[ ecalVeto ] 10000 info: set_variables Avg Time/Event = 0.027 ms
[ ecalVeto ] 10000 info: bdt_variables Avg Time/Event = 0.125 ms
I want to push strongly against only pulling out the MIP tracking. The idea behind braking up the processor is not only for performance reasons but also to make the code more maintainable and understandable. If all that can be done at this time is to pull out the tracking, that's okay and definitely a step in the right direction. I just also want to either keep this issue open or open a new one that is still focused on separating the processor into its constituent parts.
We could try to have more functions it the EcalVeto, but w/o the tracking it's a 500 lines of code so it's not too bad. There is a lot of interdependence in the rest so I'm not fully convinced that having a 3rd producer will help the maintenance / understandability. Form the point of view of the cutflow I'm imagining for the analysis: everything that goes into the BDT should be one producer / veto, and then there is a MIP tracking veto. (With this I'm also saying not to use MIP tracking in the BDT).
Does having more functions in the BDT veto file sound like a compromise to you?
Idk maybe it's fine to do what you propose, let's come back to it after we took up the MIP tracking
I guess I just see the calculation of these features as naturally disconnected from running a BDT or some other form of ML inference which makes it seem "natural" to me to separate them into different processors. Its not really lines of code to me, its more the purpose of the code.