Bug in EcalClusterAnalyzer when running with en sample
Describe the bug
At around event 45 with seed = 1 when running the GENIE EN config I get
double free or corruption (out)
Aborted (core dumped)
To Reproduce Steps to reproduce the behavior: Just run the
.github/validation_samples/target_genie/config.py
and add
dqm.EcalClusterAnalyzer()
in the sequence and it will come up in the first 50 events.
Desired behavior
Smooth running.
Screenshots
Environment:
denv_name="ldmx"
denv_image="ldmx/dev:latest"
denv_mounts=""
denv_shell="/bin/bash -i"
denv_network="true"
apptainer version 1.2.5
LDMX_NUM_EVENTS=100
LDMX_RUN_NUMBER=1
DISPLAY=localhost:17.0
DENV_NAME=ldmx
DENV_RUNNER=apptainer
DENV_IMAGE=ldmx/dev:latest
FYI @bryngemark
[ EcalClusterAnalyzer ] 1 trace: Number of ECal Scoring Plane Hits: 52
[ EcalClusterAnalyzer ] 1 trace: SP Hit Track ID: 1
[ EcalClusterAnalyzer ] 1 trace: SP Hit Track ID: 8
[ EcalClusterAnalyzer ] 1 trace: SP Hit Track ID: 9
the issue is that we take the ID = 2 for the second recoil electron, but it's not there
OK so after fixing the above, there is something else:
free(): invalid next size (fast)
Aborted (core dumped)
it's kinda weird bc the analyzer does get to the end of it. So maybe something with a destructor?
Not sure if it's exactly the same issue, but I noticed in one of my other branches that the build-with-asan-ubsan was failing consistently on the same evt#. You can check the entire stack trace if you'd like, but the relevant part is:
[ ElectronCounter ] 5 info: Found 0 electrons (tracks) using input collection TriggerPadTracksY_ [ trigger ] 5 info: Got trigger energy cut 0 for 0 electrons counted in the event. [ trigger ] 5 info: Got trigger energy sum 8617.21; and decision is pass = false [ EcalClusterAnalyzer ] 5 info: Avg number of clusters per layer: 1 [ EcalClusterAnalyzer ] 5 info: Number of ECal CLUE clusters: 1, TS counted electrons: 0, SP electrons: 0 ================================================================= ==8==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x5020005793f8 at pc 0x7fbbe397aac5 bp 0x7ffda1368980 sp 0x7ffda1368970 READ of size 8 at 0x5020005793f8 thread T0 #0 0x7fbbe397aac4 in dqm::EcalClusterAnalyzer::analyze(framework::Event const&) (/home/runner/work/ldmx-sw/ldmx-sw/install/lib/libDQM.so+0x177aac4) (BuildId: 4776f6bd15d0fb93cc7ac0659f56e9fff18d9d85)
"heap-buffer-overflow" seems to imply that some object is trying to access memory beyond its assigned block.
After some offline testing, I found out that this behavior is 1) replicable on the trunk and 2) inevitable with enough events (typically about 100 but highly variable), regardless of the starting seed/runnumber. However the current build-with-asan-ubsan only runs 15 events for a successful check, so most PRs pass this test without issue and I just got 'unlucky'.
I included some of the info logs in the segment of the error trace because notably, this event reports 0 electrons. Without checking anything, I'm surmising that the electron in the event was deflected before reaching the ECal, and maybe before even reaching the target.
This should be fine - except there's an array in EcalClusterAnalyzer::analyze (also seen in the stack trace) which goes out of range if the number of electrons in the ECal is 0.
The fastest fix is to just increase the size of this array by 1. I don't see any noticeable effect when I do so, except that the event continues to process and I can finish the validation test locally. If you'd like I can make a quick PR (literally just changing one number) and we can see how it looks with the full validation suite.
As a side note, I don't know how this has been running properly without the address sanitizer... I'm surprised that having an out-of-range array call every ~100 events when using the ECal clustering didn't break things more.
Thanks @cjbarton151 for digging into this! Please open a PR with the proposed fix. We could also return early if there are no clusters, right?
We could do that too - but I think that there could still be events where there are hits in the ECal with no electrons, and then these wouldn't be clustered. On the other hand, for events with no electrons and no hits in the ECal (HCal+pads only) the clustering algorithm still loops through the hits, probably costing a bit of runtime. I would have to play around with it a bit to see if I'm right, and if this has any noticeable effect, but both effects should be on the order of 1% due to the rarity of the electron not reaching the ECal.