root icon indicating copy to clipboard operation
root copied to clipboard

Broken streaming of vector of enum with underlying type other than int

Open ktf opened this issue 1 year ago • 22 comments

Check duplicate issues.

  • [x] Checked for duplicates

Description

I need help to understand an issue which we have when running on Linux on ARM when reading a file which was serialised on x86. Notice that this platform is peculiar, because char (without specifier) is unsigned, and not signed (char sign-ess is implementation detail in the standard).

This is important because mPadSubset that you will see below is an enum PadSubset : char. Running in valgrind, the issue appears as dumped below.

What puzzles me and what I think is the culprit of the segmentation fault is the line:

[1965517:tpc-tracker]:    i= 2, mPadSubset      type= 23, offset= 56, len=2, method=0 [optimized]

as I would have expected it to be len=1. Can you explain me what is going on?

[1965517:tpc-tracker]: ====>Rebuilding TStreamerInfo for class: o2::tpc::CalDet<o2::tpc::PadFlags>, version: 1
[1965517:tpc-tracker]: Creating StreamerInfo for class: o2::tpc::CalDet<o2::tpc::PadFlags>, version: 2
[1965517:tpc-tracker]:
[1965517:tpc-tracker]: StreamerInfo for class: o2::tpc::CalDet<o2::tpc::PadFlags>, version=2, checksum=0x93700773
[1965517:tpc-tracker]:   string         mName           offset=  0 type=300 ,stl=365, ctype=365, name of the object
[1965517:tpc-tracker]:   vector<o2::tpc::CalArray<o2::tpc::PadFlags> > mData           offset= 32 type=300 ,stl=1, ctype=61, internal CalArrays
[1965517:tpc-tracker]:   o2::tpc::PadSubset mPadSubset      offset= 56 type= 3 Pad subset granularity
[1965517:tpc-tracker]:    i= 0, mName           type=300, offset=  0, len=1, method=0
[1965517:tpc-tracker]:    i= 1, mData           type=300, offset= 32, len=1, method=0
[1965517:tpc-tracker]:    i= 2, mPadSubset      type=  3, offset= 56, len=1, method=0
[1965517:tpc-tracker]:
[1965517:tpc-tracker]: StreamerInfo for class: o2::tpc::CalDet<o2::tpc::PadFlags>, version=1, checksum=0x93700773
[1965517:tpc-tracker]:   string         mName           offset=  0 type=300 ,stl=365, ctype=365, name of the object
[1965517:tpc-tracker]:   vector<o2::tpc::CalArray<o2::tpc::PadFlags> > mData           offset= 32 type=300 ,stl=1, ctype=61, internal CalArrays
[1965517:tpc-tracker]:   o2::tpc::PadSubset mPadSubset      offset= 56 type= 3 Pad subset granularity
[1965517:tpc-tracker]:    i= 0, mName           type=300, offset=  0, len=1, method=0
[1965517:tpc-tracker]:    i= 1, mData           type=300, offset= 32, len=1, method=0
[1965517:tpc-tracker]:    i= 2, mPadSubset      type=  3, offset= 56, len=1, method=0
[1965517:tpc-tracker]:
[1965517:tpc-tracker]: ====>Rebuilding TStreamerInfo for class: o2::tpc::CalArray<o2::tpc::PadFlags>, version: 1
[1965517:tpc-tracker]:
[1965517:tpc-tracker]: StreamerInfo for class: o2::tpc::CalArray<o2::tpc::PadFlags>, version=1, checksum=0xb03d18c2
[1965517:tpc-tracker]:   string         mName           offset=  0 type=300 ,stl=365, ctype=365,
[1965517:tpc-tracker]:   vector<o2::tpc::PadFlags> mData           offset= 32 type=300 ,stl=1, ctype=3, calibration data
[1965517:tpc-tracker]:   o2::tpc::PadSubset mPadSubset      offset= 56 type= 3 Subset type
[1965517:tpc-tracker]:   int            mPadSubsetNumber offset= 60 type= 3 Number of the pad subset, e.g. ROC 0 is IROC A00
[1965517:tpc-tracker]:    i= 0, mName           type=300, offset=  0, len=1, method=0
[1965517:tpc-tracker]:    i= 1, mData           type=300, offset= 32, len=1, method=0
[1965517:tpc-tracker]:    i= 2, mPadSubset      type= 23, offset= 56, len=2, method=0 [optimized]
[1965517:tpc-tracker]: ==1965517== Invalid write of size 1
[1965517:tpc-tracker]: ==1965517==    at 0xF36E7A0: frombuf (Bytes.h:313)
[1965517:tpc-tracker]: ==1965517==    by 0xF36E7A0: frombuf (Bytes.h:442)
[1965517:tpc-tracker]: ==1965517==    by 0xF36E7A0: ReadFastArray (TBufferFile.cxx:1338)
[1965517:tpc-tracker]: ==1965517==    by 0xF36E7A0: TBufferFile::ReadFastArray(int*, int) (TBufferFile.cxx:1327)
[1965517:tpc-tracker]: ==1965517==    by 0xF3E580B: void TGenCollectionStreamer::ReadBufferVectorPrimitives<int>(TBuffer&, void*, TClass const*) (TGenCollectionStreamer.cxx:1183)
[1965517:tpc-tracker]: ==1965517==    by 0xF36EC7B: Streamer (TClass.h:614)
[1965517:tpc-tracker]: ==1965517==    by 0xF36EC7B: TBufferFile::ReadFastArray(void*, TClass const*, int, TMemberStreamer*, TClass const*) (TBufferFile.cxx:1616)
[1965517:tpc-tracker]: ==1965517==    by 0xF58C84B: int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) (TStreamerInfoReadBuffer.cxx:1297)
[1965517:tpc-tracker]: ==1965517==    by 0xF45B81F: TStreamerInfoActions::VectorLooper::GenericRead(TBuffer&, void*, void const*, TStreamerInfoActions::TLoopConfiguration const*, TStreamerInfoActions::TConfiguration const*) (TStreamerInfoActions.cxx:1883)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DAAB: operator() (TStreamerInfoActions.h:131)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DAAB: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*, void*) (TBufferFile.cxx:3736)
[1965517:tpc-tracker]: ==1965517==    by 0xF482A0F: TStreamerInfoActions::ReadSTLMemberWiseSameClass(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*, short) (TStreamerInfoActions.cxx:1155)
[1965517:tpc-tracker]: ==1965517==    by 0xF482C4F: int TStreamerInfoActions::ReadSTL<&TStreamerInfoActions::ReadSTLMemberWiseSameClass, &TStreamerInfoActions::ReadSTLObjectWiseFastArray>(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) (TStreamerInfoActions.cxx:1405)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: operator() (TStreamerInfoActions.h:123)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: ApplySequence (TBufferFile.cxx:3670)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) (TBufferFile.cxx:3661)
[1965517:tpc-tracker]: ==1965517==    by 0xF376CEB: TBufferFile::ReadClassBuffer(TClass const*, void*, TClass const*) (TBufferFile.cxx:3598)
[1965517:tpc-tracker]: ==1965517==    by 0xF3F4633: Streamer (TClass.h:614)
[1965517:tpc-tracker]: ==1965517==    by 0xF3F4633: TKey::ReadObjectAny(TClass const*) (TKey.cxx:1120)
[1965517:tpc-tracker]: ==1965517==    by 0xF3B82E3: TDirectoryFile::GetObjectChecked(char const*, TClass const*) (TDirectoryFile.cxx:1111)
[1965517:tpc-tracker]: ==1965517==  Address 0x153fbb80 is 0 bytes after a block of size 1,440 alloc'd
[1965517:tpc-tracker]: ==1965517==    at 0x4868908: operator new(unsigned long) (vg_replace_malloc.c:483)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: allocate (new_allocator.h:137)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: allocate (allocator.h:188)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: allocate (alloc_traits.h:464)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: _M_allocate (stl_vector.h:378)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: _M_allocate (stl_vector.h:375)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: std::vector<o2::tpc::PadFlags, std::allocator<o2::tpc::PadFlags> >::_M_default_append(unsigned long) (vector.tcc:650)
[1965517:tpc-tracker]: ==1965517==    by 0xF3E5797: void TGenCollectionStreamer::ReadBufferVectorPrimitives<int>(TBuffer&, void*, TClass const*) (TGenCollectionStreamer.cxx:1176)
[1965517:tpc-tracker]: ==1965517==    by 0xF36EC7B: Streamer (TClass.h:614)
[1965517:tpc-tracker]: ==1965517==    by 0xF36EC7B: TBufferFile::ReadFastArray(void*, TClass const*, int, TMemberStreamer*, TClass const*) (TBufferFile.cxx:1616)
[1965517:tpc-tracker]: ==1965517==    by 0xF58C84B: int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) (TStreamerInfoReadBuffer.cxx:1297)
[1965517:tpc-tracker]: ==1965517==    by 0xF45B81F: TStreamerInfoActions::VectorLooper::GenericRead(TBuffer&, void*, void const*, TStreamerInfoActions::TLoopConfiguration const*, TStreamerInfoActions::TConfiguration const*) (TStreamerInfoActions.cxx:1883)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DAAB: operator() (TStreamerInfoActions.h:131)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DAAB: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*, void*) (TBufferFile.cxx:3736)
[1965517:tpc-tracker]: ==1965517==    by 0xF482A0F: TStreamerInfoActions::ReadSTLMemberWiseSameClass(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*, short) (TStreamerInfoActions.cxx:1155)
[1965517:tpc-tracker]: ==1965517==    by 0xF482C4F: int TStreamerInfoActions::ReadSTL<&TStreamerInfoActions::ReadSTLMemberWiseSameClass, &TStreamerInfoActions::ReadSTLObjectWiseFastArray>(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) (TStreamerInfoActions.cxx:1405)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: operator() (TStreamerInfoActions.h:123)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: ApplySequence (TBufferFile.cxx:3670)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) (TBufferFile.cxx:3661)
[1965517:tpc-tracker]: ==1965517==    by 0xF376CEB: TBufferFile::ReadClassBuffer(TClass const*, void*, TClass const*) (TBufferFile.cxx:3598)
[1965517:tpc-tracker]: ==1965517==    by 0xF3F4633: Streamer (TClass.h:614)
[1965517:tpc-tracker]: ==1965517==    by 0xF3F4633: TKey::ReadObjectAny(TClass const*) (TKey.cxx:1120)
[1965517:tpc-tracker]: ==1965517==
[1965517:tpc-tracker]: ==1965517== Invalid write of size 1
[1965517:tpc-tracker]: ==1965517==    at 0xF36E7AC: frombuf (Bytes.h:314)
[1965517:tpc-tracker]: ==1965517==    by 0xF36E7AC: frombuf (Bytes.h:442)
[1965517:tpc-tracker]: ==1965517==    by 0xF36E7AC: ReadFastArray (TBufferFile.cxx:1338)
[1965517:tpc-tracker]: ==1965517==    by 0xF36E7AC: TBufferFile::ReadFastArray(int*, int) (TBufferFile.cxx:1327)
[1965517:tpc-tracker]: ==1965517==    by 0xF3E580B: void TGenCollectionStreamer::ReadBufferVectorPrimitives<int>(TBuffer&, void*, TClass const*) (TGenCollectionStreamer.cxx:1183)
[1965517:tpc-tracker]: ==1965517==    by 0xF36EC7B: Streamer (TClass.h:614)
[1965517:tpc-tracker]: ==1965517==    by 0xF36EC7B: TBufferFile::ReadFastArray(void*, TClass const*, int, TMemberStreamer*, TClass const*) (TBufferFile.cxx:1616)
[1965517:tpc-tracker]: ==1965517==    by 0xF58C84B: int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) (TStreamerInfoReadBuffer.cxx:1297)
[1965517:tpc-tracker]: ==1965517==    by 0xF45B81F: TStreamerInfoActions::VectorLooper::GenericRead(TBuffer&, void*, void const*, TStreamerInfoActions::TLoopConfiguration const*, TStreamerInfoActions::TConfiguration const*) (TStreamerInfoActions.cxx:1883)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DAAB: operator() (TStreamerInfoActions.h:131)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DAAB: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*, void*) (TBufferFile.cxx:3736)
[1965517:tpc-tracker]: ==1965517==    by 0xF482A0F: TStreamerInfoActions::ReadSTLMemberWiseSameClass(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*, short) (TStreamerInfoActions.cxx:1155)
[1965517:tpc-tracker]: ==1965517==    by 0xF482C4F: int TStreamerInfoActions::ReadSTL<&TStreamerInfoActions::ReadSTLMemberWiseSameClass, &TStreamerInfoActions::ReadSTLObjectWiseFastArray>(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) (TStreamerInfoActions.cxx:1405)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: operator() (TStreamerInfoActions.h:123)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: ApplySequence (TBufferFile.cxx:3670)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) (TBufferFile.cxx:3661)
[1965517:tpc-tracker]: ==1965517==    by 0xF376CEB: TBufferFile::ReadClassBuffer(TClass const*, void*, TClass const*) (TBufferFile.cxx:3598)
[1965517:tpc-tracker]: ==1965517==    by 0xF3F4633: Streamer (TClass.h:614)
[1965517:tpc-tracker]: ==1965517==    by 0xF3F4633: TKey::ReadObjectAny(TClass const*) (TKey.cxx:1120)
[1965517:tpc-tracker]: ==1965517==    by 0xF3B82E3: TDirectoryFile::GetObjectChecked(char const*, TClass const*) (TDirectoryFile.cxx:1111)
[1965517:tpc-tracker]: ==1965517==  Address 0x153fbb81 is 1 bytes after a block of size 1,440 alloc'd
[1965517:tpc-tracker]: ==1965517==    at 0x4868908: operator new(unsigned long) (vg_replace_malloc.c:483)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: allocate (new_allocator.h:137)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: allocate (allocator.h:188)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: allocate (alloc_traits.h:464)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: _M_allocate (stl_vector.h:378)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: _M_allocate (stl_vector.h:375)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: std::vector<o2::tpc::PadFlags, std::allocator<o2::tpc::PadFlags> >::_M_default_append(unsigned long) (vector.tcc:650)
[1965517:tpc-tracker]: ==1965517==    by 0xF3E5797: void TGenCollectionStreamer::ReadBufferVectorPrimitives<int>(TBuffer&, void*, TClass const*) (TGenCollectionStreamer.cxx:1176)
[1965517:tpc-tracker]: ==1965517==    by 0xF36EC7B: Streamer (TClass.h:614)

Reproducer

I do not have one which does not involve running ALICE reconstruction on ARM.

ROOT version

6.32.02.

Installation method

aliBuild

Operating system

ALMA Linux 9 on ARM64 (Ampere Altra)

Additional context

No response

ktf avatar Aug 26 '24 20:08 ktf

Can you give us a bit more information? What would be useful, if possible:

  • The stacktrace from the segfault
  • A description on how to set up the corresponding ALICE environment so that we can look at the dictionaries and headers
  • The ROOT file that caused the crash

Is it confirmed that the same data serialized on ARM does not cause a crash?

jblomer avatar Aug 27 '24 09:08 jblomer

For the file:

https://cernbox.cern.ch/s/MXkLwJLm61rckhj

I cannot confirm if the same data serialised on ARM does not cause a crash.

ktf avatar Aug 27 '24 09:08 ktf

[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2Framework.so: handle_crash(int)
[1064949:tpc-tracker]:     linux-vdso.so.1:     ?? ??:0
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TBufferFile::ReadFastArray(int*, int)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: void TGenCollectionStreamer::ReadBufferVectorPrimitives<int>(TBuffer&, void*, TClass const*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TBufferFile::ReadFastArray(void*, TClass const*, int, TMemberStreamer*, TClass const*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TStreamerInfoActions::VectorLooper::GenericRead(TBuffer&, void*, void const*, TStreamerInfoActions::TLoopConfiguration const*, TStreamerInfoActions::TConfiguration const*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*, void*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TStreamerInfoActions::ReadSTLMemberWiseSameClass(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*, short)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: int TStreamerInfoActions::ReadSTL<&TStreamerInfoActions::ReadSTLMemberWiseSameClass, &TStreamerInfoActions::ReadSTLObjectWiseFastArray>(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TBufferFile::ReadClassBuffer(TClass const*, void*, TClass const*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TKey::ReadObjectAny(TClass const*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TDirectoryFile::GetObjectChecked(char const*, TClass const*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2Framework.so: o2::framework::DataRefUtils::decodeCCDB(o2::framework::DataRef const&, std::type_info const&)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2GPUWorkflow.so: decltype(auto) o2::framework::InputRecord::get<o2::tpc::CalDet<o2::tpc::PadFlags>*, char const*>(char const*, int) const
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2GPUWorkflow.so: bool o2::gpu::GPURecoWorkflowSpec::fetchCalibsCCDBTPC<o2::gpu::GPUCalibObjectsTemplate<o2::gpu::ConstPtr> >(o2::framework::ProcessingContext&, o2::gpu::GPUCalibObjectsTemplate<o2::gpu::ConstPtr>&, o2::gpu::GPURecoWorkflowSpec::calibObjectStruct&)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2GPUWorkflow.so: o2::gpu::GPURecoWorkflowSpec::doCalibUpdates(o2::framework::ProcessingContext&, o2::gpu::GPURecoWorkflowSpec::calibObjectStruct&)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2GPUWorkflow.so: o2::gpu::GPURecoWorkflowSpec::run(o2::framework::ProcessingContext&)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2Framework.so:     ?? ??:0
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2Framework.so: o2::framework::DataProcessingDevice::tryDispatchComputation(o2::framework::ServiceRegistryRef, std::vector<o2::framework::DataRelayer::RecordAction, std::allocator<o2::framework::DataRelayer::RecordAction> >&)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2Framework.so: o2::framework::DataProcessingDevice::doRun(o2::framework::ServiceRegistryRef)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2Framework.so: o2::framework::run_callback(uv_work_s*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2Framework.so: o2::framework::DataProcessingDevice::Run()
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/FairMQ/v1.8.4-2/lib/libfairmq.so.1.8.4: fair::mq::Device::RunWrapper()
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/FairMQ/v1.8.4-2/lib/libfairmq.so.1.8.4: boost::detail::function::void_function_obj_invoker1<std::function<void (fair::mq::State)>, void, fair::mq::State>::invoke(boost::detail::function::function_buffer&, fair::mq::State)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/FairMQ/v1.8.4-2/lib/libfairmq.so.1.8.4: boost::signals2::detail::signal_impl<void (fair::mq::State), boost::signals2::optional_last_value<void>, int, std::less<int>, boost::function<void (fair::mq::State)>, boost::function<void (boost::signals2::connection const&, fair::mq::State)>, boost::signals2::mutex>::operator()(fair::mq::State)

is one of the stacktraces. It actually dies in different ways, most likely there is some memory corruption going on...

ktf avatar Aug 27 '24 10:08 ktf

For the ALICE environment, the easiest is probably sitting together. It's on a custom machine in my private area.

ktf avatar Aug 27 '24 10:08 ktf

Thanks. I'm not at CERN today but getting started with the information.

jblomer avatar Aug 27 '24 11:08 jblomer

(Side note: MakeProject does not reconstruct the enums with the correct underlying type)

jblomer avatar Aug 27 '24 11:08 jblomer

Another stacktrace which seems to be related to this is:

[1500611:internal-dpl-ccdb-backend]: Executable is /root/src/sw/slc9_aarch64/O2/dev-local1/bin/o2-tpc-reco-workflow
[1500611:internal-dpl-ccdb-backend]:     linux-vdso.so.1:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     [0xfff3cae9b014]:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     [0xfff3cae9d7f0]:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so: TCling::AutoParseImplRecurse(char const*, bool)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so: TCling::AutoParse(char const*)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so: TClingLookupHelper__AutoParse(char const*)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so: ROOT::TMetaUtils::TClingLookupHelper::GetPartiallyDesugaredNameWithScopeHandling(std::__cxx11::
basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, bool)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCore.so.6.32: TClassEdit::GetNormalizedName(std::__cxx11::basic_string<char, std::char_traits<char>, std:
:allocator<char> >&, std::basic_string_view<char, std::char_traits<char> >)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCore.so.6.32: TClass::GetClass(char const*, bool, bool, unsigned long, unsigned long)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TStreamerInfo::BuildCheck(TFile*, bool)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TFile::ReadStreamerInfo()
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TFile::Init(bool)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TMemFile::TMemFile(char const*, char*, long long, char const*, char const*, int, long long)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2CCDB.so: o2::ccdb::CcdbApi::loadFileToMemory(std::vector<char, boost::container::pmr::polymorphic_allocator<char
> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basi
c_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_s
tring<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >*) const
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2CCDB.so: o2::ccdb::CcdbApi::getFromSnapshot(bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::
allocator<char> > const&, long, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >,
 std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > con
st, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::vector<char, boost::con
tainer::pmr::polymorphic_allocator<char> >&, int&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2CCDB.so: o2::ccdb::CcdbApi::navigateSourcesAndLoadFile(o2::ccdb::CcdbApi::RequestContext&, int&, unsigned long*)
 const
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2CCDB.so: o2::ccdb::CcdbApi::vectoredLoadFileToMemory(std::vector<o2::ccdb::CcdbApi::RequestContext, std::allocat
or<o2::ccdb::CcdbApi::RequestContext> >&) const
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2CCDB.so: o2::ccdb::CcdbApi::loadFileToMemory(std::vector<char, boost::container::pmr::polymorphic_allocator<char
> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::bas$

ktf avatar Aug 27 '24 12:08 ktf

Interestingly enough, the actual array returned by backtrace can be decoded by GDB to:

$4 = {0xffffac196fb0 <handle_crash(int)+48>, 0xffffb2f727f0 <__kernel_rt_sigreturn>, 0xfff3ea6f5014, 0xfff3ea6f77f0,
  0xffff9e97b198 <(anonymous namespace)::GenericLLVMIRPlatformSupport::initialize(llvm::orc::JITDylib&)+2392>,
  0xffff9d4b0de0 <cling::IncrementalExecutor::runStaticInitializersOnce(cling::Transaction&)+272>, 0xffff9d435f78 <cling::Interpreter::executeTransaction(cling::Transaction&)+40>,
  0xffff9d4c0e30 <cling::IncrementalParser::commitTransaction(llvm::PointerIntPair<cling::Transaction*, 2u, cling::IncrementalParser::EParseResult, llvm::PointerLikeTypeTraits<cling::Transaction*>, llvm::PointerIntPairInfo<cling::Transaction*, 2u, llvm::PointerLikeTypeTraits<cling::Transaction*> > >&, bool)+768>,
  0xffff9d4c398c <cling::IncrementalParser::Compile(llvm::StringRef, cling::CompilationOptions const&)+108>,
  0xffff9d433d80 <cling::Interpreter::parseForModule(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+176>, 0xffff9d36b5f8
     <ExecAutoParse(char const*, Bool_t, cling::Interpreter*)+568>, 0xffff9d36cf48 <TCling::AutoParseImplRecurse(char const*, bool)+1400>, 0xffff9d374de4 <TCling::AutoParse(char const*)+340>,
  0xffff9d355204 <TClingLookupHelper__AutoParse(char const*)+36>, 0xffff9d2c8b44
     <ROOT::TMetaUtils::TClingLookupHelper::GetPartiallyDesugaredNameWithScopeHandling(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, bool)+116>, 0xffffa7acf42c
     <TClassEdit::GetNormalizedName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::basic_string_view<char, std::char_traits<char> >)+540>, 0xffffa7aeab58
     <TClass::GetClass(char const*, bool, bool, unsigned long, unsigned long)+1144>, 0xffffa7f852b4 <TStreamerInfo::BuildCheck(TFile*, bool)+148>, 0xffffa7f4751c <TFile::ReadStreamerInfo()+700>,
  0xffffa7f4fc40 <TFile::Init(bool)+1056>, 0xffffa7f74a60 <TMemFile::TMemFile(char const*, char*, long long, char const*, char const*, int, long long)+268>, 0xffffac4515b4
     <o2::ccdb::CcdbApi::loadFileToMemory(std::vector<char, boost::container::pmr::polymorphic_allocator<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >*) const+900>,
  0xffffac451f68 <o2::ccdb::CcdbApi::getFromSnapshot(bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::vector<char, boost::container::pmr::polymorphic_allocator<char> >&, int&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const+936>,
  0xffffac452100 <o2::ccdb::CcdbApi::navigateSourcesAndLoadFile(o2::ccdb::CcdbApi::RequestContext&, int&, unsigned long*) const+192>,
  0xffffac4524d0 <o2::ccdb::CcdbApi::vectoredLoadFileToMemory(std::vector<o2::ccdb::CcdbApi::RequestContext, std::allocator<o2::ccdb::CcdbApi::RequestContext> >&) const+240>,

ktf avatar Aug 27 '24 12:08 ktf

Some more points gathered during a debug session:

  • The problem appears only on ARM/Linux, not on ARM/Mac
  • The streamer info output
[1965517:tpc-tracker]:    i= 2, mPadSubset      type= 23, offset= 56, len=2, method=0 [optimized]

does not seem to indicate a problem because the same list of streamer elements also contains the expected

o2::tpc::PadSubset mPadSubset      offset= 56 type= 3 Subset type
  • If the class o2::tpc::CalArray<o2::tpc::PadFlags> is added to the dictionaries (Linkdef), the stacktrace changes and the crash becomes reproducible. In this case, there is an error writing beyond vector boundaries.
  • The next step is to try to reproduce the crash with a debug build of ROOT

jblomer avatar Aug 28 '24 14:08 jblomer

Further debugging revealed a deeper issue that seem to only by chance surface on ARM/Linux:

Writing or reading a vector of enums goes through the collection proxy. The collection proxy will use WriteFastArray / ReadFastArray of kInt_t, neglecting the actual underlying type of the enum. At some point in the read/write chain, this causes memory reads/writes beyond the limits of a memory array.

jblomer avatar Aug 29 '24 14:08 jblomer

I think the cause is https://github.com/root-project/root/blob/master/io/io/src/TGenCollectionProxy.cxx#L404 (and similar lines further down), that hard-code the enum underlying type to int.

When fixing, I think we need to take care of what happens to files already written out with the wrong enum width.

jblomer avatar Aug 29 '24 22:08 jblomer

Do I understand correctly this affects only scoped enums within a vector? Can I simply fix it on my side by moving to enum class Foo : int {}?

ktf avatar Aug 30 '24 07:08 ktf

Although: I'm not exactly sure if already existing files that were serialized with a shorter enum correctly read back. I think yes, but that needs to be tested.

jblomer avatar Aug 30 '24 08:08 jblomer

Although: I'm not exactly sure if already existing files that were serialized with a shorter enum correctly read back. I think yes, but that needs to be tested.

This I can try on my side.

ktf avatar Aug 30 '24 08:08 ktf

I'm attaching a minimal reproducer.

minimalTestVectorOfEnums.tar.gz

This test returns (wrongly)

Size of PadFlags: 2
Enum underlying type: 12
mFlags size before writing: 2
mFlags size after reading: 4
0 0 23824 0

With a patch to TGenCollectionProxy::Value, the result is correct:

Size of PadFlags: 2
Enum underlying type: 12
mFlags size before writing: 2
mFlags size after reading: 2
0 0

I think the next steps should be discussed with @pcanal. In particular:

  • What about the cases when we only have an emulated enum? With this patch in place, we cannot just assume anymore that this will be an int on disk.
  • In general, how do we correctly handle vectors of enums with underlying types different than int that are on disk, before and after the patch?

jblomer avatar Aug 30 '24 08:08 jblomer

AFAICT, neither TTree nor RNTuple I/O are affected by this issue.

jblomer avatar Aug 30 '24 09:08 jblomer

[1965517:tpc-tracker]: i= 2, mPadSubset type= 23, offset= 56, len=2, method=0 [optimized] as I would have expected it to be len=1. Can you explain me what is going on?

If the next data member (which should not be listed right after it) is of the same type, TStreamerInfo will collate them (note the optimized part).

pcanal avatar Sep 05 '24 17:09 pcanal

We shall be able to fix the usage in regular I/O and TTree (which is also broken) when using dictionary. The proper support in bare ROOT might be harder (the underlying size information is a bit harder to find and in some case might not be (yet?) available (top level vector of enums)).

pcanal avatar Sep 05 '24 20:09 pcanal

In general, how do we correctly handle vectors of enums with underlying types different than int that are on disk, before and after the patch?

With dictionaries, it seems to work fine (for embedded vectors probably not for standalone vector) because the TStreamerInfo of the containing class records the underlying type and thus know when a conversion is needed (The corollary is that a class version number must be updated (to allow schema evolution) if one of the enums type it uses changes its underlying type).

pcanal avatar Sep 05 '24 20:09 pcanal

For the record, as you might have seen in https://github.com/AliceO2Group/AliceO2/pull/13464, simply changing the types breaks reading back old files (i.e. two shorts are read in an int). Could you comment when do you expect to have a fix for this on your side which applies to 6.32.2 and if it will allow old code to still read new data (and viceversa new code / old data)?

ktf avatar Sep 06 '24 05:09 ktf

Side note for the record, the original valgrind report and crash happens in the case where the vector<EnumType> is itself held in a vector (of CalArray) held into an object (CalDet).

I have a workaround that solves the problem for the case in the minimal reproducer which resolves around setting a read rule for the vector of enums:

template <typename E>
void LoadEnumCollection(/* const */ std::vector<E> &onfile, std::vector<E> &enums)
{
   constexpr size_t delta = sizeof(int)/sizeof(E);
   const size_t nvalues = onfile.size() / delta;
   onfile.resize(nvalues);
   std::swap(onfile, enums);
};
#pragma read sourceClass="Event" checksums="[0xa2558fd6]" targetClass="Event" source="std::vector<PadFlags> mFlags" target="mFlags" code="{ LoadEnumCollection(onfile.mFlags, mFlags); }"

However it does not work yet for the actual/original problem :(. (In the minimal reproducer the size of the container is double what it should be has no over-write/crash, while in the original the container ends up with the right size but with an over-write and thus crash).

pcanal avatar Sep 06 '24 20:09 pcanal

The following custom Streamer works around the issue:

template <typename Flags>
inline void CalArray<Flags>::Streamer(TBuffer &R__b)
{
   // Stream an object of class CalArray<PadFlags>.

   if (R__b.IsReading()) {
      UInt_t R__s, R__c;
      Version_t R__v = R__b.ReadVersion(&R__s, &R__c);
      if (R__v <= 3) {
         {
            UInt_t start, count;
            Version_t vers = R__b.ReadVersion(&start, &count);

            std::vector<int> R__stl;
            R__stl.clear();
            int R__n;
            R__b >> R__n;
            R__stl.reserve(R__n);
            for (int R__i = 0; R__i < R__n; R__i++) {
               Int_t readtemp;
               R__b >> readtemp;
               R__stl.push_back(readtemp);
            }
            R__b.CheckByteCount(start, count, "stl collection of enums");

            mFlags.clear();
            auto data = reinterpret_cast<unsigned short*>(R__stl.data());
            constexpr size_t delta = sizeof(int)/sizeof(Flags);
            for(int i = 0; i < R__n; ++i)
               mFlags.push_back(static_cast<PadFlags>( data[i] ));
         }
         int tmp;
         R__b >> tmp;
         mPadSubset = static_cast<PadSubset>(tmp);

         R__b.CheckByteCount(R__s, R__c, CalArray::IsA());
      } else {
         R__b.ReadClassBuffer(CalArray<Flags>::Class(),this, R__v, R__s, R__c);
      }
   } else {
      R__b.WriteClassBuffer(CalArray<Flags>::Class(),this);
   }
}

[Call to ReadClassBuffer was corrected to add missing parameters]

pcanal avatar Sep 06 '24 23:09 pcanal

Any followup to the bug itself? Will we have a fix in ROOT which avoids a custom streamer?

ktf avatar Oct 31 '24 08:10 ktf

Any followup to the bug itself? Will we have a fix in ROOT which avoids a custom streamer?

Yes. https://github.com/root-project/root/pull/17009 solves the problem and files produced with those changes can be written and read without any customization. Reading files that were written prior to those changes and containing enum with non-default size, and thus were incorrectly written, will require explicit customization because the data layout in the file depends on what the enum size was at the time of written and this information is not recorded in the file and thus requires manual intervention.

pcanal avatar Dec 06 '24 16:12 pcanal

@ktf The PR was merged in the master. Please let us know if you encounter any (new) problem.

pcanal avatar Jan 30 '25 21:01 pcanal

Hi @pcanal, @jblomer, @dpiparo,

It appears this issue is closed, but wasn't yet added to a project. Please add upcoming versions that will include the fix, or 'not applicable' otherwise.

Sincerely, :robot:

github-actions[bot] avatar Feb 04 '25 06:02 github-actions[bot]