cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

[RNTUPLE_X] RelVal 139.005: object of class vector<edm::StoredProductProvenance> read too few bytes

Open iarspider opened this issue 6 months ago • 5 comments

In CMSSW_15_1_RNTUPLE_X_2025-07-24-1100, RelVal 139.005 failed on step 2:

Begin processing the 29987th record. Run 346512, Event 5233680, LumiSection 6 on stream 0 at 24-Jun-2025 22:04:59.787 CEST
----- Begin Fatal Exception 24-Jun-2025 22:04:59 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Reading branch EventProductProvenance
   Additional Info:
      [a] Fatal Root Error: @SUB=TBufferFile::CheckByteCount
object of class vector<edm::StoredProductProvenance> read too few bytes: 8 instead of 288

----- End Fatal Exception -------------------------------------------------

full log

iarspider avatar Jun 25 '25 09:06 iarspider

cms-bot internal usage

cmsbuild avatar Jun 25 '25 09:06 cmsbuild

A new Issue was created by @iarspider.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Jun 25 '25 09:06 cmsbuild

assign core

iarspider avatar Jun 25 '25 09:06 iarspider

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Jun 25 '25 09:06 cmsbuild

With -j1 step2 passes.

iarspider avatar Jun 25 '25 09:06 iarspider

In CMSSW_15_1_RNTUPLE_X_2025, two relvals failed with this error: Relval 139.005 step 2, Relval 1049.0 step 2

iarspider avatar Jun 27 '25 08:06 iarspider

With -j1 step2 passes.

And with -j2, surprisingly. But -j3, -j4 etc fails.

iarspider avatar Jun 27 '25 11:06 iarspider

This is almost certainly a threading problem within root

Dr15Jones avatar Jun 27 '25 12:06 Dr15Jones

assign root

Dr15Jones avatar Jun 27 '25 12:06 Dr15Jones

type root

Dr15Jones avatar Jun 27 '25 12:06 Dr15Jones

@pcanal Any thoughts? It looks like a threading problem (doesn't reproduce on 1 thread), and on 4 threads in all the past 5 RNTUPLE_X IBs we have logs for the job failed after processing 21k - 370k events (depending on the day).

makortel avatar Jul 07 '25 20:07 makortel

The meta description does indeed point to a thread problem. The low level symptom (wrong bytecount) unfortunately does not help (it could be that the writing was wrong or that the ROOT meta-data was wrong during reading or that the TFile was accessed 'wrongly').

Is there a reproducer with a debug build of ROOT>

pcanal avatar Jul 07 '25 20:07 pcanal

a debug build of ROOT

We need to build one now that the RNTUPLE_X IB is built with production setup

makortel avatar Jul 07 '25 21:07 makortel

The meta description does indeed point to a thread problem. The low level symptom (wrong bytecount) unfortunately does not help (it could be that the writing was wrong or that the ROOT meta-data was wrong during reading or that the TFile was accessed 'wrongly').

Is there a reproducer with a debug build of ROOT>

@pcanal Here is a recipe on cmsdev42

/cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9965/47122/install.sh
cd CMSSW_15_1_RNTUPLE_X_2025-07-06-2300/src
cmsenv
cmsRun /build/mkortela/debug/issue48400/CMSSW_15_1_RNTUPLE_X_2025-07-06-2300/src/reproducer_cfg.py

(although for me the ROOT debug build didn't reproduce the error, possibly because of being slower than the production build and thus hiding the problem)

makortel avatar Jul 09 '25 18:07 makortel

Indeed, I can't reproduce it with the debug build :(

pcanal avatar Jul 11 '25 18:07 pcanal

Can I have a recipe with the build/config that reproduces it?

pcanal avatar Jul 11 '25 18:07 pcanal

Can I have a recipe with the build/config that reproduces it?

Still on cmsdev42

cmsrel CMSSW_15_1_RNTUPLE_X_2025-07-06-2300
cd CMSSW_15_1_RNTUPLE_X_2025-07-06-2300/src
cmsenv
cmsRun /build/mkortela/debug/issue48400/CMSSW_15_1_RNTUPLE_X_2025-07-06-2300/src/reproducer_cfg.py

makortel avatar Jul 11 '25 20:07 makortel

We could also rebuild the debug build with RelWithDebInfo instead of Debug if that would help.

makortel avatar Jul 11 '25 20:07 makortel

The bug is hiding from me :(. No luck so far seeing it in a debugger with the second recipe either.

pcanal avatar Jul 15 '25 21:07 pcanal

Still happens: https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc12/CMSSW_15_1_RNTUPLE_X_2025-08-13-2300/pyRelValMatrixLogs/run/139.005_AlCaPhiSym2021/step2_AlCaPhiSym2021.log#/377844-377844

----- Begin Fatal Exception 14-Aug-2025 06:18:56 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Reading branch EventProductProvenance
   Additional Info:
      [a] Fatal Root Error: @SUB=TBufferFile::CheckByteCount
object of class vector<edm::StoredProductProvenance> read too few bytes: 8 instead of 288

----- End Fatal Exception -------------------------------------------------

iarspider avatar Aug 14 '25 09:08 iarspider

Still happens: http://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc13/CMSSW_16_0_RNTUPLE_X_2025-10-08-2300/pyRelValMatrixLogs/run/139.005_AlCaPhiSym2021/step2_AlCaPhiSym2021.log#/102585-102585

----- Begin Fatal Exception 09-Oct-2025 04:05:31 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Reading branch EventProductProvenance
   Additional Info:
      [a] Fatal Root Error: @SUB=TBufferFile::CheckByteCount
object of class vector<edm::StoredProductProvenance> read too few bytes: 8 instead of 288

----- End Fatal Exception -------------------------------------------------

iarspider avatar Oct 10 '25 07:10 iarspider