Memory bug patching/ASAN
List of memory issues caught by ASAN/UBSAN and their fix-status
- [ ] PN DQM missing &
- [x] Ecal veto buffer overflow missing
.size() - [x] https://github.com/LDMX-Software/Hcal/issues/58
- [ ] Ecal veto mapsx/mapsy
- [ ] Undefined
noiseentry in calorimeter hits - [ ] Missing case in hcal scintillator orientation
- [ ] SimSpecialID UB? Probably not worth dealing with
Describe the bug
A missing & in one of the helper functions in the PN DQM code meant that the particle map was copied rather than referenced which made any pointers to simparticles therein invalid after the scope of the function. This bug hasn't had any impact on the DQM output, if it did we would have seen some difference. Regardless, we should of course patch it.
The culprit is here https://github.com/LDMX-Software/ldmx-sw/blob/0079ce00ff851fd427f2191c45c72d7857c37c17/DQM/include/DQM/PhotoNuclearDQM.h#L96-L110
Where the map is supposed to be passed in by const ref.
I caught this while trying to diagnose a completely different issue by running with address sanitizer enabled. I'm wondering if it would be useful to include these kinds of tools directly in our DQM jobs. At least ASAN has a minimal performance overhead. UBSAN could be interesting, but it doesn't by default kill a job if it catches something (it has some false positives) so you would only benefit from it if you read the logs.
Caught another one in the Ecal veto/geometry code. When we are trying to fill the isolated hit map, we assume the list of neighbouring cells that we get from the Ecal geometry object to always have 6 entries. However, I'm seeing some events where you get a smaller number of neighbours (e.g. 4). This makes
cellNbrIds[k] = ldmx::EcalID(id.layer(), cellNbrIds[k].module(), cellNbrIds[k].cell()); do a buffer overflow.
I'm not sure if the assumption that there are always 6 neighbours is correct, if so there's another bug in some other place (e.g. the ecal geometry code). If not, the solution is to just loop over the size of the vector. @tomeichlersmith maybe you know?
https://github.com/LDMX-Software/Ecal/blob/bb668db60852b86d85eab1bce910c64be02b16e2/src/Ecal/EcalVetoProcessor.cxx#L907-L922
// Skip hits that have a readout neighbor
// Get neighboring cell id's and try to look them up in the full cell map
// (constant speed algo.)
// these ideas are only cell/module (must ignore layer)
std::vector<ldmx::EcalID> cellNbrIds = geometry_->getNN(id);
for (int k = 0; k < 6; k++) {
// update neighbor ID to the current layer
cellNbrIds[k] = ldmx::EcalID(id.layer(), cellNbrIds[k].module(),
cellNbrIds[k].cell());
// look in cell hit map to see if it is there
if (cellMap_.find(cellNbrIds[k]) != cellMap_.end()) {
isolatedHit = std::make_pair(false, cellNbrIds[k]);
break;
}
}
I am also seeing an overflow in
dis[0] = faceXY[0] - mapsx[index + step];
Which I'm guessing has to be related to the mapsx access.
Index and step are defined the following way
int step = 0;
std::vector<float>::iterator it;
it = std::lower_bound(mapsx.begin(), mapsx.end(), faceXY[0]);
index = std::distance(mapsx.begin(), it);
if (index == mapsx.size()) {
index += -1;
}
So we can never have an issue because index is out of bounds. However, if index is max and step > 0 then this will also be an overflow.
Step is updated below
if (up == 0) {
step += 1;
} else {
step += -1;
}
So, I'm guessing that we are entering the first branch here when index is max which then overflows (and the same will happen for the y-map). This could actually be a bit more of an issue than the previous ones, if we have bad luck then writing to index+step for mapsx could be writing into mapsy's memory
The relevant code is here https://github.com/LDMX-Software/Ecal/blob/bb668db60852b86d85eab1bce910c64be02b16e2/src/Ecal/EcalVetoProcessor.cxx#L492-L536
https://github.com/LDMX-Software/Hcal/issues/58 was also caught by this
The assumption that there are always 6 neighbors is incorrect, there will be cells on the edge of the flowers that have less than 6 neighbors so ASAN has indeed found a bug for us. (hopefully only in memory and not in values).
Ok, the patch for that is straight-forward. I'm less sure about the mapsx/mapsy code. I'm not entirely sure what it is doing, so I can only patch it in a brute-force manner by checking the index + step values. Do you think this issue arises from the same kind of thing?
Following the discussion at the software developers meeting, I'll convert this to an issue about memory issues in general.
Yea, I'm struggling to understand the mapsx/mapsy code as well - I think flagging it as an issue in the Ecal repo will allow it to be on my To-Do list (probably done at the same time I break up the enormouse EcalVetoProcessor).
Updating this to be about all the issues that need to be dealt with if we want to be able to require ASAN clean builds for running validation. Keeping a list of all issues I've identified at the top
I'm currently working on some testing for the Hcal and happened to run into a couple of new issues
getScintillatorOrientationin the HcalGeometry doesn't handle the back properly for v12/v13 (missing switch case)- The
EcalRecProducerandHcalRecProducerdoesn't set theisNoisemember for theEcalHit/HcalHits. Since these aren't given a default value in theCalorimeterHit, that means theisNoisemember ends up containing junk which are virtually guaranteed to correspond totrue. Notably, this actually shows up in the output from the HCal test suite which has aCHECKonisNoisethat fails but this doesn't show up in thectestoutput :( - Extremely minor and probably not worth fixing, but the
SimSpecialID ssid2(SimSpecialID::SimSpecialType(9), 0xFEDCB);code in the DetectorID tests is undefined behavior.
Lastly, ASAN reports a bunch of use-after-free in the Framework tests. These seem to be related to conditions that have already been destructed. Here's a very condensed version of one of those reports (most of the catch internals removed)
==17==ERROR: AddressSanitizer: heap-use-after-free on address 0x60c000000dc0 at pc 0x55655e7eb1a2 bp 0x7fff47402580 sp 0x7fff47402570
READ of size 8 at 0x60c000000dc0 thread T0
#0 0x55655e7eb1a1 in void conditions::test::matchesMeta<conditions::DoubleTableCondition>(conditions::DoubleTableCondition const&, conditions::DoubleTableCondition const&) /home/einarelen/ldmx/ldmx-sw/Conditions/test/TestConditions.cxx:27
#1 0x55655e7c237a in conditions::test::matchesAll(conditions::DoubleTableCondition const&, conditions::DoubleTableCondition const&) /home/einarelen/ldmx/ldmx-sw/Conditions/test/TestConditions.cxx:32
0x60c000000dc0 is located 0 bytes inside of 128-byte region [0x60c000000dc0,0x60c000000e40)
freed by thread T0 here:
#0 0x7f60155c322f in operator delete(void*, unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:172
#1 0x55655e7e646c in conditions::DoubleTableCondition::~DoubleTableCondition() /home/einarelen/ldmx/ldmx-sw/Conditions/include/Conditions/SimpleTableCondition.h:193
#2 0x7f601421c270 in framework::ConditionsObjectProvider::releaseConditionsObject(framework::ConditionsObject const*) /home/einarelen/ldmx/ldmx-sw/Framework/include/Framework/ConditionsObjectProvider.h:90
#3 0x7f6010050ebe in framework::Conditions::getConditionPtr(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/einarelen/ldmx/ldmx-sw/Framework/src/Framework/Conditions.cxx:91
#4 0x55655e7f6739 in conditions::DoubleTableCondition const& framework::Conditions::getCondition<conditions::DoubleTableCondition>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/einarelen/ldmx/ldmx-sw/Framework/include/Framework/Conditions.h:84
previously allocated by thread T0 here:
#0 0x7f60155c21c7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
#1 0x7f6010d6c967 in conditions::SimpleCSVTableProvider::getCondition(ldmx::EventHeader const&) /home/einarelen/ldmx/ldmx-sw/Conditions/src/Conditions/SimpleCSVTableProvider.cxx:190
#2 0x7f60100502c2 in framework::Conditions::getConditionPtr(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/einarelen/ldmx/ldmx-sw/Framework/src/Framework/Conditions.cxx:69
#3 0x55655e7f6739 in conditions::DoubleTableCondition const& framework::Conditions::getCondition<conditions::DoubleTableCondition>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/einarelen/ldmx/ldmx-sw/Framework/include/Framework/Conditions.h:84
#4 0x55655e7cc8b1 in CATCH2_INTERNAL_TEST_0 /home/einarelen/ldmx/ldmx-sw/Conditions/test/TestConditions.cxx:225
@tomeichlersmith I know you dealt with a bunch of issues with creating multiple instances of the Process object when working on the docker upgrade. Do you think this could be a similar issue but for conditions?
Hmm, might not be the same kind of issue. Tried running the tests one by one and you only get the problem for the Conditions test
One last thing, I keep seeing TypeError: 'float' object cannot be interpreted as an integer at the start of the process. I'm not sure where this comes from or what it is about.
In good news, after patching all of the above (hacking around the mapsx/mapsy-issue), we are ASAN-safe and mostly UBSAN-safe. Tried all of the validation configs
The TypeError: 'float' object cannot be interpreted as an integer originates in ConfigurePython. I am unsure on where it is coming from however since it is not crashing the program either.
For the use-after-free issues, I'm guessing that it has something to do with the tests where we load conditions via the ConfigurePython->Process pipeline and then test them.
https://github.com/LDMX-Software/Conditions/blob/4056cf8684bd6eea0e13e4ab3e24958ccf8124e3/test/TestConditions.cxx#L182-L195
Ecal veto buffer overflow missing .size()
I put the tick in for this as this was fixed already in https://github.com/LDMX-Software/ldmx-sw/commit/5c6bee2126998962e6af45bae2b4bd0498c4b32f
@EinarElen, when you have some time can you send instruction on how you ran ASAN?
You need to build ldmx-sw with some additional cmake commands, in particular
-DENABLE_SANITIZER_ADDRESS=ON
There is some additional documentation in the cmake module, I think you'll want the stuff about recovering on error
https://github.com/LDMX-Software/ldmx-sw/blob/trunk/cmake%2FSanitizers.cmake
Hey @EinarElen I'm back to this a bit again, I ran
ldmx cmake -B build -S . -DENABLE_SANITIZER_ADDRESS=ON
ldmx cmake --build build --target install -- -j$(nproc)
and it didnt show anything new. Are you doing something else too?
It is a runtime thing, running with the sanitizer settings when you build enables it. So just run your simulation and it should crash at some point :)
OK so good news it that I can ran this. Bad news is that the first thing I see, I have very little idea on what to do with. The only thing I understand is that there is potentially a bug in the EcalVeto. This is what I see
=7==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6290002761fc at pc 0x7f0df47a8a55 bp 0x7ffd2ab849b0 sp 0x7ffd2ab849a0
READ of size 4 at 0x6290002761fc thread T0
#0 0x7f0df47a8a54 in ecal::EcalVetoProcessor::produce(framework::Event&) (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libEcal.so+0x1a8a54)
#1 0x7f0e01b8444a in framework::Process::process(int, framework::Event&) const (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libFramework.so+0x18444a)
#2 0x7f0e01b8b6a6 in framework::Process::run() (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libFramework.so+0x18b6a6)
#3 0x5573a5fe23a6 in main (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/bin/fire+0x93a6)
#4 0x7f0dff229d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
#5 0x7f0dff229e3f in __libc_start_main_impl ../csu/libc-start.c:392
#6 0x5573a5fe2b54 in _start (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/bin/fire+0x9b54)
0x6290002761fc is located 4 bytes to the left of 16384-byte region [0x629000276200,0x62900027a200)
allocated by thread T0 here:
#0 0x7f0e01eb61e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
#1 0x7f0e00e3a27d in void std::vector<float, std::allocator<float> >::_M_realloc_insert<float const&>(__gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, float const&) (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libSimCore_Event.so+0x427d)
SUMMARY: AddressSanitizer: heap-buffer-overflow (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libEcal.so+0x1a8a54) in ecal::EcalVetoProcessor::produce(framework::Event&)
Shadow bytes around the buggy address:
0x0c5280046be0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c5280046bf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c5280046c00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c5280046c10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c5280046c20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c5280046c30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa[fa]
0x0c5280046c40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c5280046c50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c5280046c60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c5280046c70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c5280046c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
@EinarElen any ideas on how to extract more info on what's going on?
Alternatively I'll do some couts here and there
Usually it is worth to run a debug build to get a bit more info out of it, but regardless it still tells you that there is a buffer overflow in EcalVetoProcessor::produce.
AFAIK it is this
https://github.com/LDMX-Software/ldmx-sw/issues/1166#issuecomment-1570187979
Ok I did a few couts, indeed it's the one with maps indices.
A good event
index 1304 step 0
dis[0]: -3.40018
dis[1]: 74.6287
index 1304 step 1
dis[0]: -3.40018
dis[1]: 59.6287
index 1304 step 2
dis[0]: -3.40018
dis[1]: 44.6287
index 1304 step 3
dis[0]: -3.40018
dis[1]: 29.6287
index 1304 step 4
dis[0]: -3.40018
dis[1]: 14.6287
index 1304 step 5
dis[0]: -3.40018
dis[1]: -0.371328
the problematic event
index 0 step 0
dis[0]: -19.848
dis[1]: 96.0028
index 0 step -1
>> breaks here
Then I find that the faceXY[0]: -262.335 which is outside of the range and thus results to the 0 index.
Another example, is where faceXY = 290.302 and
recoilPos[0] -276.52 recoilP[0] -60.2409 recoilP[2] 16.6094
so already he recoilPos is outside the ECAL
OK resolved the
Ecal veto mapsx/mapsy
in https://github.com/LDMX-Software/ldmx-sw/pull/1410
I have student Ananda who'll look into the noise generation, so in that PR I think we can take the
Undefined noise entry in calorimeter hits
part.
I'll prob have the DQM one on my next todo list.
As for the other two points, I dont know the HCAL enough to fix that one, while for the test, do you have another suggestion to test instead @EinarElen ?
I'll prob have the DQM one on my next todo list.
Ah this was easy, it's already fixed: https://github.com/LDMX-Software/ldmx-sw/blob/trunk/DQM/include/DQM/PhotoNuclearDQM.h#L104 (It was fixed in https://github.com/LDMX-Software/ldmx-sw/pull/1219) I ticked the entry in this issue.
isNoise is fixed in https://github.com/LDMX-Software/ldmx-sw/pull/1434 for the other two I need @EinarElen
https://github.com/LDMX-Software/ldmx-sw/blob/30e0b80a43a527f30ae08219eaec5529ea8de105/DetDescr/src/DetDescr/HcalGeometry.cxx#L91C1-L146C2
The first one should already be done :)
The other one, I think you just need to create a SimSpecialID in code built with undefined behaviour santizier
Yaay, all that list is now gone. I guess we can close this issue and have the rest of the problems in https://github.com/LDMX-Software/ldmx-sw/issues/1176 ?