cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

HLT crash caused by `SiPixelDigisClustersFromSoA` (run 357271)

Open missirol opened this issue 3 years ago • 50 comments

In run-357271, one HLT job crashed with the following error message:

[2] Calling method for module SiPixelDigisClustersFromSoA/'hltSiPixelClustersFromSoA'
Exception Message:
DetSetVector::inserv called with index already in collection;
index value: 344794116

(the monitoring tool does not provide the full error message from cmsRun, afaik)

The error is reproducible (see recipe below). Since it originates from the GPU branch of the reconstruction sequence, it can be reproduced only on a machine with a GPU. The input file is currently on lxplus.

FYI: @fwyzard @silviodonato

cmsrel CMSSW_12_4_6
cd CMSSW_12_4_6/src
cmsenv

hltGetConfiguration \
  run:357271 \
  --globaltag 124X_dataRun3_HLT_v4 \
  --process HLT \
  --data \
  --unprescale \
  --output all \
  --input file:/afs/cern.ch/work/m/missirol/public/fog/edm_run357271_ls1351.root \
  > hlt.py

cmsRun hlt.py &> hlt.log

missirol avatar Aug 12 '22 15:08 missirol

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Aug 12 '22 15:08 cmsbuild

assign reconstruction, heterogeneous

makortel avatar Aug 12 '22 16:08 makortel

FYI @cms-sw/trk-dpg-l2 @VinInn

makortel avatar Aug 12 '22 16:08 makortel

New categories assigned: heterogeneous,reconstruction

@jpata,@fwyzard,@clacaputo,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Aug 12 '22 16:08 cmsbuild

I could reproduce it running on the original error stream file on the online machines:

cmsrel CMSSW_12_4_6
mkdir CMSSW_12_4_6/run
cd CMSSW_12_4_6/run
cmsenv

https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 357271 > hlt.py

cat >> hlt.py <<@EOF
process.source.fileListMode = True
process.source.fileNames = [ '/store/error_stream/run357271/run357271_ls1351_index000134_fu-c2b03-14-01_pid4069417.raw' ]
@EOF

cmsRun hlt.py

I did get the same error, and there is a message shortly before it that may be related:

%MSG-w SiPixelDigisClustersFromSoA:   SiPixelDigisClustersFromSoA:hltSiPixelClustersFromSoA  13-Aug-2022 19:13:33 CEST Run: 357271 Event: 2627443577
cluster below charge Threshold Layer/DetId/clusId 0/344794116/1 size/charge 1/3741
%MSG
...
----- Begin Fatal Exception 13-Aug-2022 19:13:34 CEST-----------------------
An exception of category 'InvalidReference' occurred while
   [0] Processing  Event run: 357271 lumi: 1351 event: 2627443577 stream: 1
   [1] Running path 'HLT_Mu10_TrkIsoVVL_DiPFJet40_DEta3p5_MJJ750_HTT350_PFMETNoMu60_v16'
   [2] Calling method for module SiPixelDigisClustersFromSoA/'hltSiPixelClustersFromSoA'
Exception Message:
DetSetVector::inserv called with index already in collection;
index value: 344794116
----- End Fatal Exception -------------------------------------------------

fwyzard avatar Aug 13 '22 17:08 fwyzard

I did a bit of investigation and the code is crashing in https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_6/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc#L114 where edmNew::DetSetVector<SiPixelCluster>::FastFiller is constructed. The constructor eventually end up calling edmNew::DetSetVector<T>::addItem which in turn ends up calling edmNew::dstvdetails::errorIdExists where an exception is thrown. The reason for the exception is that the edmNew::DetSetVector<SiPixelCluster>::FastFiller assumes that a single DetId appears only once in a loop. Instead, what I observe in run: 357271 lumi: 1351 event: 2627443577 are DetIds which are not contiguous in the loop over digis in SiPixelDigisClustersFromSoA. Here is the printout

>> DetId digi: 344659972 39690
>> DetId digi: 344659972 39691
>> DetId digi: 344659972 39692
>> DetId digi: 344659972 39693
>> DetId: 344659972
>>>> clusId: 0
>> DetId digi: 344794116 39694
>> DetId digi: 344794116 39695
>> DetId digi: 344794116 39697
>> DetId digi: 344794116 39698
>> DetId digi: 344794116 39699
>> DetId digi: 344794116 39700
>> DetId digi: 344794116 39701
>> DetId digi: 344794116 39702
>> DetId digi: 344794116 39703
>> DetId digi: 344794116 39704
>> DetId digi: 344794116 39705
>> DetId digi: 344794116 39706
>> DetId: 344794116
>>>> clusId: 0
>>>> clusId: 1
%MSG-w SiPixelDigisClustersFromSoA:   SiPixelDigisClustersFromSoA:hltSiPixelClustersFromSoA  14-Aug-2022 18:48:14 CEST Run: 357271 Event: 2627443577
cluster below charge Threshold Layer/DetId/clusId 0/344794116/1 size/charge 1/3741
%MSG
>>>> clusId: 2
>>>> clusId: 3
>>>> clusId: 4
>>>> clusId: 5
>>>> clusId: 6
>>>> clusId: 7
>> DetId digi: 344795140 39707
>> DetId digi: 344795140 39708
>> DetId: 344795140
>>>> clusId: 0
>>>> clusId: 1
>> DetId digi: 344794116 39710
>> DetId digi: 344794116 39711
>> DetId digi: 344794116 39712
>> DetId digi: 344794116 39713
>> DetId digi: 344794116 39714
%MSG-w SiPixelDigisClustersFromSoA:   SiPixelDigisClustersFromSoA:hltSiPixelClustersFromSoA  14-Aug-2022 18:48:14 CEST Run: 357271 Event: 2627443577
Problem det present twice in input! 344794116
%MSG
>> DetId digi: 344795140 39716
>> DetId digi: 344795140 39717
>> DetId digi: 344795140 39718
>> DetId digi: 344795140 39719
>> DetId digi: 344795140 39720
>> DetId digi: 344795140 39721
>> DetId digi: 344795140 39722
>> DetId digi: 344795140 39723
>> DetId digi: 344795140 39724
>> DetId digi: 344795140 39725
>> DetId digi: 344795140 39726
>> DetId digi: 344795140 39727
>> DetId digi: 344795140 39728
>> DetId digi: 344795140 39729
>> DetId digi: 344795140 39730
>> DetId digi: 344795140 39731
>> DetId digi: 344795140 39732
>> DetId digi: 344795140 39733
>> DetId digi: 344795140 39734
>> DetId digi: 344795140 39735
%MSG-w SiPixelDigisClustersFromSoA:   SiPixelDigisClustersFromSoA:hltSiPixelClustersFromSoA  14-Aug-2022 18:48:14 CEST Run: 357271 Event: 2627443577
Problem det present twice in input! 344795140
%MSG

obtained by adding the following changes to SiPixelDigisClustersFromSoA

diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
index d36c345ecf0..c0328d665b0 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
@@ -43,6 +43,8 @@ private:
   const bool produceDigis_;
   const bool storeDigis_;
   const bool isPhase2_;
+  const std::string moduleType_;
+  const std::string moduleLabel_;
 };
 
 SiPixelDigisClustersFromSoA::SiPixelDigisClustersFromSoA(const edm::ParameterSet& iConfig)
@@ -53,7 +55,9 @@ SiPixelDigisClustersFromSoA::SiPixelDigisClustersFromSoA(const edm::ParameterSet
                          iConfig.getParameter<int>("clusterThreshold_otherLayers")},
       produceDigis_(iConfig.getParameter<bool>("produceDigis")),
       storeDigis_(iConfig.getParameter<bool>("produceDigis") & iConfig.getParameter<bool>("storeDigis")),
-      isPhase2_(iConfig.getParameter<bool>("isPhase2")) {
+      isPhase2_(iConfig.getParameter<bool>("isPhase2")),
+      moduleType_(iConfig.getParameter<std::string>("@module_type")),
+      moduleLabel_(iConfig.getParameter<std::string>("@module_label")) {
   if (produceDigis_)
     digiPutToken_ = produces<edm::DetSetVector<PixelDigi>>();
 }
@@ -111,10 +115,18 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
   auto fillClusters = [&](uint32_t detId) {
     if (nclus < 0)
       return;  // this in reality should never happen
+    if (outputClusters->exists(detId)) {
+      edm::LogWarning("SiPixelDigisClustersFromSoA")
+              << "Problem det present twice in input! " << detId;
+      nclus = -1;
+      return;
+    }
     edmNew::DetSetVector<SiPixelCluster>::FastFiller spc(*outputClusters, detId);
+    std::cout << ">> DetId: " << detId << std::endl;
     auto layer = (DetId(detId).subdetId() == 1) ? ttopo.pxbLayer(detId) : 0;
     auto clusterThreshold = clusterThresholds_.getThresholdForLayerOnCondition(layer == 1);
     for (int32_t ic = 0; ic < nclus + 1; ++ic) {
+      std::cout << ">>>> clusId: " << ic << std::endl;
       auto const& acluster = aclusters[ic];
       // in any case we cannot  go out of sync with gpu...
       if (acluster.charge < clusterThreshold and !isPhase2_)
@@ -143,6 +155,9 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
       spc.abort();
   };
 
+  std::cout << "===========================================================================" << std::endl;
+  std::cout << moduleType_ << ":" << moduleLabel_ << " START" << std::endl;
+  std::cout << "===========================================================================" << std::endl;
   for (uint32_t i = 0; i < nDigis; i++) {
     // check for uninitialized digis
     if (digis.rawIdArr(i) == 0)
@@ -172,6 +187,7 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
         }
       }
     }
+    std::cout << ">> DetId digi: " << detId << " " << i << std::endl;
     PixelDigi dig(digis.pdigi(i));
     if (storeDigis_)
       (*detDigis).data.emplace_back(dig);
@@ -186,7 +202,9 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
     SiPixelCluster::PixelPos pix(row, col);
     aclusters[digis.clus(i)].add(pix, digis.adc(i));
   }
-
+  std::cout << "===========================================================================" << std::endl;
+  std::cout << moduleType_ << ":" << moduleLabel_ << " END" << std::endl;
+  std::cout << "===========================================================================" << std::endl;
   // fill final clusters
   if (detId > 0)
     fillClusters(detId);

The root cause of the problem therefore seems to be somewhere else, presumably in either SiPixelDigisSoAFromCUDA or SiPixelRawToClusterCUDA. @VinInn or @AdrianoDee might have a better idea what could be the cause.

ferencek avatar Aug 14 '22 19:08 ferencek

For reference, this type of crash occurred online again in runs 357759 and 357778.

(If it helps, we can get the problematic events from those runs, although there is already a recipe to reproduce the crash.)

missirol avatar Aug 22 '22 17:08 missirol

Hi, I was doing some investigations on top of those done by @ferencek and I've found that for the detID that are failing something weird happens. By modifying SiPixelRawToClusterGPUKernel.cu with

--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -332,6 +332,23 @@ namespace pixelgpudetails {
     //if (threadIdx.x==0) printf("Event: %u blockIdx.x: %u start: %u end: %u\n", eventno, blockIdx.x, begin, end);

     int32_t first = threadIdx.x + blockIdx.x * blockDim.x;
+    if(first==0)
+    {
+      for (uint32_t i = 0; i < wordCounter; i++) {
+        uint32_t ww = word[i];
+        uint8_t fedId = fedIds[i / 2];  // +1200;
+
+
+        uint32_t link = sipixelconstants::getLink(ww);  // Extract link
+        uint32_t roc = sipixelconstants::getROC(ww);    // Extract Roc in link
+        pixelgpudetails::DetIdGPU detId = getRawId(cablingMap, fedId, link, roc);
+
+        uint32_t rawId = detId.rawId;
+
+        printf("wordGPU %d %d %d \n",i, ww, rawId);
+
+      }
+    }
     for (int32_t iloop = first, nend = wordCounter; iloop < nend; iloop += blockDim.x * gridDim.x) {
       auto gIndex = iloop;
       xx[gIndex] = 0;

@@ -559,6 +577,10 @@ namespace pixelgpudetails {

       cudaCheck(
           cudaMemcpyAsync(word_d.get(), wordFed.word(), wordCounter * sizeof(uint32_t), cudaMemcpyDefault, stream));
+
+        for (unsigned int j = 0; j < wordCounter; j++) {
+          std::cout << "words "<< j << " - " << wordFed.word()[j] << std::endl;
+        }
       cudaCheck(cudaMemcpyAsync(
           fedId_d.get(), wordFed.fedId(), wordCounter * sizeof(uint8_t) / 2, cudaMemcpyDefault, stream));

what I see (I'm clamping the output here) is that the words on CPU are OK, e.g:

words 39691 - 3159063610
words 39692 - 3159063098
words 39693 - 3159062586
words 39694 - 3159062058
words 39695 - 3159630376
words 39696 - 3170333486

while on GPU (the last number being the raw detID and there's the incriminated one, 344794116)

wordGPU 39691 -1602121006 344659972
wordGPU 39692 -1602121573 344659972
wordGPU 39693 -1602120904 344659972
wordGPU 39694 -1537898446 344794116
wordGPU 39695 -1533446856 344794116
wordGPU 39696 -1524583238 344794116

where it seems that going above 2^31 is causing an overflow (while those beings uint32_t):

words 41699 - 2098796897 -> below
words 41700 - 2150718614 -> above
wordGPU 41699 2098796897 344258564 OK
wordGPU 41700 -2144248682 344258564 Overflow

It isn't normal, is it?

I'm trying to understand why this happens (and seems this could be the cause of the crashes, or at least being correlated to it) and posting it here if somebody sees something I don't see.

AdrianoDee avatar Aug 24 '22 08:08 AdrianoDee

+        printf("wordGPU %d %d %d \n",i, ww, rawId);

Shouldn't that be

    printf("wordGPU %ud %ud %ud \n",i, ww, rawId);

I'm surprised the compiler didn't warn, modern compilers are generally pretty good at spotting implicit type conversions in printf.

dan131riley avatar Aug 24 '22 12:08 dan131riley

You are right. Sorry for the huge noise. For the warning I imagine that's nvcc not being that smart or me not noticing it.

AdrianoDee avatar Aug 24 '22 12:08 AdrianoDee

Using @AdrianoDee's code and printing the rawId from thread 0, you get something like:

wordGPU 34049 2816475237 344795140  <-- first appearance
wordGPU 34050 2760128559 344794116 
wordGPU 34051 2760522087 344794116 
wordGPU 34052 2760522751 344794116 
wordGPU 34053 2760967176 344794116 
wordGPU 34054 2760392292 344794116 
wordGPU 34055 2760032913 344794116 
wordGPU 34056 2760096912 344794116 
wordGPU 34057 2761623844 344794116 
wordGPU 34058 2761030128 344794116 
wordGPU 34059 2814380042 344795140  <-- second appearance
wordGPU 34060 2823233889 344794116 
wordGPU 34061 2823298389 344794116 
wordGPU 34062 2823826783 344794116 
wordGPU 34063 2824308583 344794116 
wordGPU 34064 2830242124 344794116 
wordGPU 34065 2830307924 344794116 
wordGPU 34066 2832440512 344794116 
wordGPU 34067 2832439930 344794116 
wordGPU 34068 2832440128 344794116 
wordGPU 34069 2832440769 344794116 
wordGPU 34070 2881488896 344921092 
wordGPU 34071 2897088372 344795140  <-- last appearance 
wordGPU 34072 2897153667 344795140 
wordGPU 34073 2897153312 344795140 
wordGPU 34074 2897153945 344795140 
wordGPU 34075 2897219168 344795140 
wordGPU 34076 2897218658 344795140 
wordGPU 34077 2897219009 344795140

The ~~third~~fourth column being the rawId, correct?

Since this is printed sequentially inside the RawToDigi_kernel, shouldn't the rawIds be already contiguous? If yes, where is this sorting done? I.e. 344795140 is the first in the list above, then 344794116 comes in and 344795140 re-appears in the end.

Excuse my ignorance, I haven't found any documentation on where this sorting is done.

nothingface0 avatar Sep 01 '22 16:09 nothingface0

I don't think there is any explicit sorting in the CUDA code path. The body of SiPixelRawToClusterCUDA::acquire() just reformats the incoming raw data for the GPU consumption.

Could the out-of-order appearance of DetId's be a feature of the raw data itself? The legacy raw-to-digi code seems to handle that case https://github.com/cms-sw/cmssw/blob/73bb0f634d47e4f2477cf3c021d1ebac66725a56/EventFilter/SiPixelRawToDigi/src/PixelDataFormatter.cc#L205-L211

The SiPixelDigisClustersFromSoA has also similar code https://github.com/cms-sw/cmssw/blob/73bb0f634d47e4f2477cf3c021d1ebac66725a56/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc#L166-L172 (is that warning message visible?) for digis (that use edm::DetSetVector). For clusters (that use edmNew::DetSetVector, I think, the exception gets thrown here https://github.com/cms-sw/cmssw/blob/73bb0f634d47e4f2477cf3c021d1ebac66725a56/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc#L114 in case the detId already exists in the edmNew::DetSetVector.

So for the legacy digi-cluster chain it seems to me the sorting of digis happens implicitly in filling first the edm::DetSetVector for digis, and using that as the input data for the clusterizer.

makortel avatar Sep 01 '22 17:09 makortel

For the record, we continue to see this crash online sporadically. Runs where this happened:

357721
357759
357778
359718
359812

(I haven't investigated recent crashes, I hope the original reproducer is sufficient)

FYI: @cms-sw/hlt-l2

missirol avatar Oct 06 '22 07:10 missirol

Having tested the configuration posted by @missirol above with a dummy fix in which we sort the digis in SiPixelDigisClustersFromSoA by the raw id:

--git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
index d36c345ecf0..d5d2ae8a0c6 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
@@ -17,6 +17,8 @@
 #include "Geometry/Records/interface/TrackerTopologyRcd.h"
 #include "Geometry/CommonTopologies/interface/SimplePixelTopology.h"
 
+#include <numeric>
+
 // local include(s)
 #include "PixelClusterizerBase.h"
 #include "SiPixelClusterThresholds.h"
@@ -143,7 +145,13 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
       spc.abort();
   };
 
-  for (uint32_t i = 0; i < nDigis; i++) {
+  std::vector<uint32_t> sortIdxs(nDigis);
+  std::iota(sortIdxs.begin(), sortIdxs.end(), 0);
+  std::sort(
+      sortIdxs.begin(), sortIdxs.end(), [&](int32_t const i, int32_t const j) { return digis.rawIdArr(i) > digis.rawIdArr(j); });
+
+  for (uint32_t id = 0; id < nDigis; id++) {
+    auto i = sortIdxs[id];

everything run smoothly. I don't see any obvious drawback (the sorting itself takes $0.71 \pm 0.12 \text{ms}$ on a gpu machine at P5). If this makes sense I may quickly open a PR.

AdrianoDee avatar Oct 07 '22 10:10 AdrianoDee

@AdrianoDee the one thing that is not clear to me is if there is anything that makes use of the digis being sorted on the gpu ?

fwyzard avatar Oct 07 '22 12:10 fwyzard

GPU code expects data from a single module to be contiguous. It does not expect rawid to be sorted. The sequence printed by @ferencek and @nothingface0 should give problems on gpu: see https://cmssdt.cern.ch/dxr/CMSSW/source/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h#68 SO in the case in hand one should find module 344795140 three times and module 344794116 at least twice (as the printout by @ferencek shows): that is NOT sane.

VinInn avatar Oct 07 '22 12:10 VinInn

OK, then the sorting (at least by module id) should be added to the cpu code ?

fwyzard avatar Oct 07 '22 13:10 fwyzard

Not sure what to do. The input format is not what expected (duplicated modules?). It's a bit like duplicated pixels. The numbering of the clusters is not consistent in any case: there will be clusters with the same gpu-index in those duplicated modules. In the event above the first two 344795140 seem suspicious but they are splitting 344794116 in two (or three). Clustering cannot be "perfect" in such a case (to say the least). I do not think that any code on GPU checks for duplicates modules so , as long as clusters are there, the rest of the reco should be ok. Still I'm surprised that after the sorting there is no mess in code that builds legacy clusters (need to check how the index-in-module is converted in a index in event)

VinInn avatar Oct 07 '22 13:10 VinInn

here https://cmssdt.cern.ch/dxr/CMSSW/source/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h#386 something for sure goes "wrong". Last arrived is served...

VinInn avatar Oct 07 '22 13:10 VinInn

In my opinion if one switch GPU_DEBUG some assert will fail. As is RecHits "maybe" messed up in the affected modules (some missing, some "added"). A crash on gpu is unlikely, not excluded.

VinInn avatar Oct 07 '22 13:10 VinInn

Actually, no, I do not get an assertion failure:

decoding 42236 digis.
CUDA countModules kernel launch with 165 blocks of 256 threads
CUDA findClus kernel launch with 1856 blocks of 384 threads
start clusterizer for module 701 in block 58
start clusterizer for module 401 in block 54
histo size 18
histo size 3
start clusterizer for module 1101 in block 307
start clusterizer for module 901 in block 337
# loops 3
# loops 5
histo size 11
start clusterizer for module 1401 in block 529
histo size 5
start clusterizer for module 801 in block 556
1 clusters in module 401
5 clusters in module 701
# loops 3
# loops 3
histo size 8
start clusterizer for module 201 in block 764
histo size 11
start clusterizer for module 1301 in block 832
4 clusters in module 1101
2 clusters in module 901
# loops 3
# loops 3
histo size 24
start clusterizer for module 501 in block 997
start clusterizer for module 301 in block 1033
histo size 18
2 clusters in module 1401
1 clusters in module 801
start clusterizer for module 1801 in block 1137
start clusterizer for module 1601 in block 1154
# loops 3
start clusterizer for module 1201 in block 1194
histo size 20
# loops 3
histo size 13
start clusterizer for module 101 in block 1259
5 clusters in module 201
histo size 21
start clusterizer for module 1001 in block 1350
histo size 21
14 clusters in module 1301
# loops 3
histo size 21
# loops 3
start clusterizer for module 1 in block 1466
histo size 50
5 clusters in module 501
# loops 5
histo size 14
6 clusters in module 301
# loops 3
# loops 11
start clusterizer for module 601 in block 1597
start clusterizer for module 1501 in block 1594
start clusterizer for module 1701 in block 1640
histo size 129
# loops 5
7 clusters in module 1801
# loops 3
9 clusters in module 1201
7 clusters in module 1601
histo size 9
histo size 10
histo size 19
10 clusters in module 101
# loops 7
4 clusters in module 1001
# loops 3
# loops 3
# loops 3
19 clusters in module 1
2 clusters in module 601
5 clusters in module 1501
10 clusters in module 1701
start cluster charge cut for module 701 in block 58
start cluster charge cut for module 401 in block 54
start cluster charge cut for module 1101 in block 307
start cluster charge cut for module 901 in block 337
start cluster charge cut for module 1401 in block 529
start cluster charge cut for module 801 in block 556
start cluster charge cut for module 201 in block 764
start cluster charge cut for module 1301 in block 832
start cluster charge cut for module 501 in block 997
start cluster charge cut for module 301 in block 1033
start cluster charge cut for module 1801 in block 1137
start cluster charge cut for module 1601 in block 1154
start cluster charge cut for module 1201 in block 1194
start cluster charge cut for module 101 in block 1259
start cluster charge cut for module 1001 in block 1350
start cluster charge cut for module 1 in block 1466
start cluster charge cut for module 1501 in block 1594
start cluster charge cut for module 601 in block 1597
start cluster charge cut for module 1701 in block 1640
moduleStart 1856 9956
moduleStart 96 2136
moduleStart 1184 5806
launching getHits kernel for 1709 blocks
hitbuilder: 1 clusters in module 401. will write at 3884
hitbuilder: 5 clusters in module 701. will write at 4842
hitbuilder: 4 clusters in module 1101. will write at 5653
hitbuilder: 2 clusters in module 901. will write at 5222
hitbuilder: 1 clusters in module 801. will write at 5045
hitbuilder: 2 clusters in module 1401. will write at 7184
hitbuilder: 14 clusters in module 1301. will write at 6489
hitbuilder: 5 clusters in module 201. will write at 2763
hitbuilder: 5 clusters in module 501. will write at 4185
hitbuilder: 6 clusters in module 301. will write at 3419
hitbuilder: 6 clusters in module 1801. will write at 9605
hitbuilder: 7 clusters in module 1601. will write at 8449
hitbuilder: 10 clusters in module 101. will write at 2156
hitbuilder: 9 clusters in module 1201. will write at 5959
hitbuilder: 4 clusters in module 1001. will write at 5428
hitbuilder: 17 clusters in module 1. will write at 13
hitbuilder: 2 clusters in module 601. will write at 4541
hitbuilder: 4 clusters in module 1501. will write at 7865
hitbuilder: 9 clusters in module 1701. will write at 9014
LayerStart 0/10 at module 0: 0
LayerStart 1/10 at module 96: 2136
LayerStart 2/10 at module 320: 3548
LayerStart 3/10 at module 672: 4787
LayerStart 4/10 at module 1184: 5806
LayerStart 5/10 at module 1296: 6433
LayerStart 6/10 at module 1408: 7209
LayerStart 7/10 at module 1520: 7928
LayerStart 8/10 at module 1632: 8539
LayerStart 9/10 at module 1744: 9225
LayerStart 10/10 at module 1856: 9956
Allocation for tuple building. N hits 9956
finished building pixel tracks on GPU

only the failure in the CPU conversion:

%MSG-w SiPixelDigisClustersFromSoA:   SiPixelDigisClustersFromSoA:hltSiPixelClustersFromSoA  07-Oct-2022 17:39:59 CEST Run: 357271 Event: 2627443577
cluster below charge Threshold Layer/DetId/clusId 0/344794116/1 size/charge 1/3741
%MSG
%MSG-w SiPixelFrameConverter:  SiPixelRawToDigi:hltSiPixelDigisRegForDisplaced  07-Oct-2022 17:39:59 CEST Run: 357271 Event: 2627443577
Map shows no fed=1326, link=41, roc=9
%MSG
%MSG-w SiPixelFrameConverter:  SiPixelRawToDigi:hltSiPixelDigisRegForDisplaced  07-Oct-2022 17:39:59 CEST Run: 357271 Event: 2627443577
Map shows no fed=1326, link=41, roc=10
%MSG
%MSG-w SiPixelFrameConverter:  SiPixelRawToDigi:hltSiPixelDigisRegForDisplaced  07-Oct-2022 17:39:59 CEST Run: 357271 Event: 2627443577
Map shows no fed=1326, link=41, roc=12
%MSG
%MSG-w SiPixelFrameConverter:  SiPixelRawToDigi:hltSiPixelDigisRegForDisplaced  07-Oct-2022 17:39:59 CEST Run: 357271 Event: 2627443577
Map shows no fed=1326, link=41, roc=16
%MSG
%MSG-w SiPixelFrameConverter:  SiPixelRawToDigi:hltSiPixelDigisRegForDisplaced  07-Oct-2022 17:39:59 CEST Run: 357271 Event: 2627443577
Map shows no fed=1326, link=41, roc=17
%MSG
----- Begin Fatal Exception 07-Oct-2022 17:39:59 CEST-----------------------
An exception of category 'InvalidReference' occurred while
   [0] Processing  Event run: 357271 lumi: 1351 event: 2627443577 stream: 0
   [1] Running path 'DQM_PixelReconstruction_v4'
   [2] Calling method for module SiPixelDigisClustersFromSoA/'hltSiPixelClustersFromSoA'
Exception Message:
DetSetVector::inserv called with index already in collection;
index value: 344794116
----- End Fatal Exception -------------------------------------------------

fwyzard avatar Oct 07 '22 16:10 fwyzard

ok. let's protect the conversion from SoA and go on: (rechits module most probably did not run yet)

VinInn avatar Oct 07 '22 16:10 VinInn

Adding the sort, I don't get any other messages.

fwyzard avatar Oct 07 '22 17:10 fwyzard

@ferencek @nothingface0 while sorting the data technically fixes the crash, it is not clear to us if the resulting reconstruction is correct, and in fact if the input data are valid or not. Could you comment if it is expected that some modules are split across the data stream ?

fwyzard avatar Oct 09 '22 18:10 fwyzard

assign trk-dpg

fwyzard avatar Oct 09 '22 18:10 fwyzard

New categories assigned: trk-dpg

@connorpa,@sroychow you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Oct 09 '22 18:10 cmsbuild

Hi @fwyzard @VinInn and all. Do you still need raw files from recent runs where this error appeared, for testing purposes? At the moment you can find run 359812 on Hilton: /store/error_stream/run359812 We had another crash from SiPixelDigisClustersFromSoA in run 360019 yesterday , which we haven't copied over yet. Would that still be useful (and same for future runs)?

trocino avatar Oct 09 '22 18:10 trocino

Hi @trocino , I think it would be useful to keep these files, as they can help us determine if the data is unpexpected but good, or the result of some data corruption (at least until the Pixel experts weigh in, one way or the other).

fwyzard avatar Oct 09 '22 19:10 fwyzard

@ferencek @nothingface0 while sorting the data technically fixes the crash, it is not clear to us if the resulting reconstruction is correct, and in fact if the input data are valid or not. Could you comment if it is expected that some modules are split across the data stream ?

I would naively think there is something wrong with the input data and that those non-contiguous module IDs are due to spurious pixel hits (possibly the mechanism by which they appear is similar to the one causing duplicate pixels within the same module). However, as @makortel pointed out, digis and clusters are stored in different types of the DetSetVector collection and the one used for digis (edm::DetSetVector) is immune to non-contiguous module IDs through its find_or_insert(rawId) method. So I wonder if there was a reason for edm::DetSetVector to be able to handle non-contiguous module IDs when being filled (internally digis are stored in a contiguous fashion, though). On the other hand, I can see why the version used for clusters (edmNew::DetSetVector) cannot handle non-contiguous module IDs. This is because clustering is done module by module when looping over the digi collection so a single module is processed by the clusterizer only once and there is simply no need to support non-contiguous module IDs. However, this now causes problems for SiPixelDigisClustersFromSoA where both collections are being filled in parallel (in the legacy code digis always come before clusters).

In summary, if there is some problem with the input data, the legacy code is completely oblivious to it. At the same time, I do not see a robust way to detect and fix the problem. The problem is sufficiently rare that it has essentially no impact on physics so it's probably just a matter of handling it gracefully in the code. Maybe @dkotlins has some additional insight.

Regarding the proposed sorting, wouldn't it be better to do it somewhere upstream, perhaps already in SiPixelRawToClusterCUDA?

ferencek avatar Oct 09 '22 22:10 ferencek

detecting on GPU is trivial enough to count the number of appearance of a given module at https://cmssdt.cern.ch/dxr/CMSSW/source/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h#87

What to do next is unclear to me. One could invalidate "single" spurious pixels and re-count (the tricky part is the definition of "spurious"). sorting on gpu is tough.

VinInn avatar Oct 10 '22 07:10 VinInn