cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Memory profiling causes rocmIsEnabled to segfault

Open iarspider opened this issue 10 months ago • 29 comments

Output of LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so gdb --args rocmIsEnabled (these libraries are preloaded if --maxmem_profile is passed to cmsDriver):

#0  0x00001555538a8a4c in _int_free () from /lib64/libc.so.6
#1  0x00001555555441ac in operator delete(void*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02878/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-23-0000/lib/el8_amd64_gcc12/libPerfToolsAllocMonitorPreload.so
#2  0x000015554f8dd24d in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::updateDepth(llvm::GenericCycle<llvm::GenericSSAContext<llvm::Function> >*) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#3  0x000015554f8ddd93 in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::run(llvm::BasicBlock*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#4  0x000015554f8dfc22 in llvm::CycleInfoWrapperPass::runOnFunction(llvm::Function&) [clone .localalias.5] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#5  0x0000155550035c69 in llvm::FPPassManager::runOnFunction(llvm::Function&) [clone .localalias.4] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#6  0x0000155550035db1 in llvm::FPPassManager::runOnModule(llvm::Module&) [clone .localalias.54] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#7  0x0000155550036a7f in llvm::legacy::PassManagerImpl::run(llvm::Module&) [clone .localalias.36] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#8  0x000015554c0695ac in clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >, clang::BackendConsumer*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#9  0x000015554c0454a1 in clang::CodeGenAction::ExecuteAction() [clone .localalias.40] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#10 0x000015554db23851 in clang::FrontendAction::Execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#11 0x000015554daaf2fa in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) [clone .localalias.2] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#12 0x000015554bb9e673 in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#13 0x000015554af7fd6b in COMGR::AMDGPUCompiler::executeInProcessDriver(llvm::ArrayRef<char const*>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#14 0x000015554af81fdc in COMGR::AMDGPUCompiler::processFile(char const*, char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#15 0x000015554af82618 in COMGR::AMDGPUCompiler::processFiles(amd_comgr_data_kind_s, char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#16 0x000015554af9356d in amd_comgr_do_action () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#17 0x0000155553e8c205 in amd::device::Program::compileAndLinkExecutable(amd_comgr_data_set_s, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, amd::option::Options*, char**, unsigned long*, amd::device::Program::file_type_t) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#18 0x0000155553e8ebd4 in amd::device::Program::linkImplLC(amd::option::Options*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#19 0x0000155553e8b141 in amd::device::Program::build(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, amd::option::Options*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#20 0x0000155553eb4b26 in amd::Program::build(std::vector<amd::Device*, std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*, bool, bool) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#21 0x0000155553e85ded in amd::Device::BlitProgram::create(amd::Device*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#22 0x0000155553ec2edb in amd::roc::Device::createBlitProgram() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#23 0x0000155553f06aa8 in amd::roc::KernelBlitManager::createProgram(amd::roc::Device&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#24 0x0000155553edb4cd in amd::roc::VirtualGPU::create() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#25 0x0000155553ebce08 in amd::roc::Device::createVirtualDevice(amd::CommandQueue*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#26 0x0000155553ea9f74 in amd::HostQueue::HostQueue(amd::Context&, amd::Device&, unsigned long, unsigned int, amd::CommandQueue::Priority, std::vector<unsigned int, std::allocator<unsigned int> > const&) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#27 0x0000155553e05e19 in hip::Stream::Stream(hip::Device*, hip::Stream::Priority, unsigned int, bool, std::vector<unsigned int, std::allocator<unsigned int> > const&, hipStreamCaptureStatus) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#28 0x0000155553ca1c94 in hip::Device::NullStream(bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#29 0x0000155553d4dba7 in hip::ihipMemset(void*, long, unsigned long, unsigned long, ihipStream_t*, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#30 0x0000155553d7ed2c in hip::hipMemset(void*, int, unsigned long) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#31 0x00000000004012fd in isRocmDeviceSupported(int) ()
#32 0x0000000000401190 in main ()

iarspider avatar Feb 25 '25 14:02 iarspider

assign core,heterogeneous

iarspider avatar Feb 25 '25 14:02 iarspider

New categories assigned: core,heterogeneous

@Dr15Jones,@fwyzard,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Feb 25 '25 14:02 cmsbuild

cms-bot internal usage

cmsbuild avatar Feb 25 '25 14:02 cmsbuild

A new Issue was created by @iarspider.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Feb 25 '25 14:02 cmsbuild

This was discovered when preparing https://github.com/cms-sw/cms-bot/pull/2418.

iarspider avatar Feb 25 '25 14:02 iarspider

@iarspider what was the command you issued to start gdb? I do not understand what you meant when you wrote "if --maxmem_profile is passed to cmsRun" as cmsRun does not accept --maxmem_profile as a command line argument. I could believe a configuration file passed to cmsRun would take that argument.

Dr15Jones avatar Feb 25 '25 15:02 Dr15Jones

I do not understand what you meant when you wrote "if --maxmem_profile is passed to cmsRun" as cmsRun does not accept --maxmem_profile as a command line argument. I could believe a configuration file passed to cmsRun would take that argument.

It seems to be the argument for cmsDriver.py. IIRC its impact is just the LD_PRELOAD.

makortel avatar Feb 25 '25 15:02 makortel

So I looked at the output of one of the failing RelVals in the PR in question. The log contains

Starting env LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so  cmsRun -j JobReport3.xml  step3_RAW2DIGI_RECO_DQM.py
Memory Report: total memory requested: 2227
Memory Report:  max memory used: 2280
Memory Report:  presently used: 0
Memory Report:  # allocations calls:   13
Memory Report:  # deallocations calls: 16
----- Begin Fatal Exception 24-Feb-2025 15:11:42 EET-----------------------
An exception of category 'ConfigFileReadError' occurred while
   [0] Processing the python configuration file named step3_RAW2DIGI_RECO_DQM.py
Exception Message:
 unknown python problem occurred.
ValueError: -11 is not a valid PlatformStatus

At:
  /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02878/el8_amd64_gcc12/external/python3/3.9.14-ccc34bac15aa449b4c76ba24d02d2fd7/lib/python3.9/enum.py(713): __new__
  /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02878/el8_amd64_gcc12/external/python3/3.9.14-ccc34bac15aa449b4c76ba24d02d2fd7/lib/python3.9/enum.py(384): __call__
  /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02878/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-23-0000/src/HeterogeneousCore/ROCmCore/python/ProcessAcceleratorROCm.py(19): enabledLabels
  /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02878/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-23-0000/src/FWCore/ParameterSet/python/Config.py(1535): handleProcessAccelerators
  /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02878/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-23-0000/src/FWCore/ParameterSet/python/Config.py(1468): fillProcessDesc
  <string>(2): <module>

----- End Fatal Exception -------------------------------------------------

So cmsRun starts up and when running the python to determine which hardware accelerators are available it tries to use the ProcessAcceleratorROCm python module. It appears that when that is loaded it crashes. The call to enableLabels seen in they python stack has

https://github.com/cms-sw/cmssw/blob/aacbf30ec812748373088ee4b79b03e4c06bd3ea/HeterogeneousCore/ROCmCore/python/ProcessAcceleratorROCm.py#L19

so that is the origin of the call to the stand alone binary rocmIsEnabled mentioned in the description of the issue.

Dr15Jones avatar Feb 25 '25 15:02 Dr15Jones

We have seen this behavior also before https://github.com/cms-sw/cmssw/issues/45964#issuecomment-2433236456

makortel avatar Feb 25 '25 15:02 makortel

So I ran

LD_PRELOAD="libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so" rocmIsEnabled; echo $?

a dozen or so times at FNAL using a CMSSW_15_1_RNTUPLE_X_2025-02-16-2300 work area (since I had it handy) and I saw no problems.

What is needed to get a consistent (or at least a probable) crash?

Dr15Jones avatar Feb 25 '25 15:02 Dr15Jones

@Dr15Jones I got this crash on a node (LUMI) with ROCm-enabled GPU using CMSSW_15_1_X_2025-02-23-0000 (but I think only the first part is important).

iarspider avatar Feb 25 '25 15:02 iarspider

It seems to be technically possible to avoid passing the LD_PRELOAD (or filtering out these libraries if there is something else) where ProcessAcceleratorROCm calls the rocmIsEnabled.

makortel avatar Feb 25 '25 16:02 makortel

Of course avoiding the crash in rocmIsEnabled when doing the LD_PRELOAD in cmsRun is great, but is this just a canary in the coal mine where we will crash in cmsRun itself when we try to use the rocm based GPU?

Dr15Jones avatar Feb 25 '25 16:02 Dr15Jones

I vaguely recall the MaxMemoryPreload was supposed to be run only on select IB flavors (I can't find the discussion though, I did find the cms-bot PR adding the use of --maxmem_profile https://github.com/cms-sw/cms-bot/pull/2202).

makortel avatar Feb 25 '25 16:02 makortel

@gartung mentioned he experienced crash in rocmIsEnabled (on a machine without AMD GPU) also preloading other (profiling) libraries. So maybe we should drop the LD_PRELOAD from the environment when calling rocmIsEnabled (to not cause problems on non-AMD-GPU machines)

makortel avatar Feb 25 '25 17:02 makortel

@gartung this actually works for me on a bare metal node, with a Radeon Pro W7800:

$ cd /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_0_pre3

$ cmsenv

$ LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so rocmIsEnabled
Memory Report: total memory requested: 231066386
Memory Report:  max memory used: 14799096
Memory Report:  presently used: 8
Memory Report:  # allocations calls:   732952
Memory Report:  # deallocations calls: 741971

$ LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so rocmComputeCapabilities
   0     gfx1100    AMD Radeon PRO W7800
Memory Report: total memory requested: 231094400
Memory Report:  max memory used: 14799304
Memory Report:  presently used: 8
Memory Report:  # allocations calls:   733031
Memory Report:  # deallocations calls: 743723

Update on this node it also works within an Alma 8 or Alma 9 container:

Singularity> rocmIsEnabled

Singularity> LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so rocmComputeCapabilities
   0     gfx1100    AMD Radeon Graphics
Memory Report: total memory requested: 231103916
Memory Report:  max memory used: 14799200
Memory Report:  presently used: 8
Memory Report:  # allocations calls:   733223
Memory Report:  # deallocations calls: 743916

fwyzard avatar Feb 25 '25 17:02 fwyzard

I confirm that it does fail on LUMI, in an Alma 8 container:

$ rocmComputeCapabilities
   0    gfx90a:sramecc+:xnack-    AMD Instinct MI250X

$ LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so rocmComputeCapabilities
Segmentation fault

fwyzard avatar Feb 25 '25 17:02 fwyzard

I was using export LD_PREOAD=libprofiler.so and then running cmsRun config.py. The os.system('rocmIsEnabled') returned -11 instead of 0,1,2 which was the expected return.

gartung avatar Feb 25 '25 18:02 gartung

@fwyzard Would it be at all possible to use a DEBUG version of CMSSW on LUMI, use gdb and get a full backtrace?

Dr15Jones avatar Feb 25 '25 18:02 Dr15Jones

So maybe we should drop the LD_PRELOAD from the environment when calling rocmIsEnabled (to not cause problems on non-AMD-GPU machines)

FWIW I opened a draft PR to do that https://github.com/cms-sw/cmssw/pull/47452

makortel avatar Feb 25 '25 21:02 makortel

@Dr15Jones

Would it be at all possible to use a DEBUG version of CMSSW on LUMI, use gdb and get a full backtrace?

ehr... with a debug build (I used CMSSW_15_1_DBG_X_2025-02-19-2300) the LD_PRELOAD command does not crash:

andbocci@nid007977:/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_DBG_X_2025-02-19-2300$ cmsenv

andbocci@nid007977:/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_DBG_X_2025-02-19-2300$ LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so rocmIsEnabled 
Memory Report: total memory requested: 248269565
Memory Report:  max memory used: 14760840
Memory Report:  presently used: 16
Memory Report:  # allocations calls:   860831
Memory Report:  # deallocations calls: 869850

andbocci@nid007977:/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_DBG_X_2025-02-19-2300$ echo $?
0

fwyzard avatar Feb 25 '25 22:02 fwyzard

Actually it also works with the same non-DEBUG IB:

andbocci@nid007977:/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-19-2300$ cmsenv

andbocci@nid007977:/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-19-2300$ LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so rocmIsEnabled 
Memory Report: total memory requested: 248052703
Memory Report:  max memory used: 14760776
Memory Report:  presently used: 8
Memory Report:  # allocations calls:   860799
Memory Report:  # deallocations calls: 869818

andbocci@nid007977:/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-19-2300$ echo $?
0

fwyzard avatar Feb 25 '25 22:02 fwyzard

It actually works on all releases earlier than CMSSW_15_1_X_2025-02-21-2300, and fails on that release and all more recent ones.

fwyzard avatar Feb 25 '25 22:02 fwyzard

TBB (version v2022.0.0) was updated for CMSSW_15_1_X_2025-02-21-2300 and above

smuzaffar avatar Feb 25 '25 22:02 smuzaffar

Mhm 🤔

fwyzard avatar Feb 25 '25 22:02 fwyzard

Anyway, here is the GDB stack trace from rocmIsEnabled in CMSSW_15_1_X_2025-02-21-2300:

$ gdb -ex 'set pagination off' -ex 'set environment LD_PRELOAD libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so' -ex r -ex bt rocmIsEnabled

...

Thread 1 "rocmIsEnabled" received signal SIGSEGV, Segmentation fault.
0x00001555538a8a4c in _int_free () from /lib64/libc.so.6
#0  0x00001555538a8a4c in _int_free () from /lib64/libc.so.6
#1  0x00001555555441ac in operator delete(void*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/lib/el8_amd64_gcc12/libPerfToolsAllocMonitorPreload.so
#2  0x000015554f8dd24d in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::updateDepth(llvm::GenericCycle<llvm::GenericSSAContext<llvm::Function> >*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#3  0x000015554f8ddd93 in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::run(llvm::BasicBlock*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#4  0x000015554f8dfc22 in llvm::CycleInfoWrapperPass::runOnFunction(llvm::Function&) [clone .localalias.5] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#5  0x0000155550035c69 in llvm::FPPassManager::runOnFunction(llvm::Function&) [clone .localalias.4] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#6  0x0000155550035db1 in llvm::FPPassManager::runOnModule(llvm::Module&) [clone .localalias.54] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#7  0x0000155550036a7f in llvm::legacy::PassManagerImpl::run(llvm::Module&) [clone .localalias.36] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#8  0x000015554c0695ac in clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >, clang::BackendConsumer*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#9  0x000015554c0454a1 in clang::CodeGenAction::ExecuteAction() [clone .localalias.40] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#10 0x000015554db23851 in clang::FrontendAction::Execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#11 0x000015554daaf2fa in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) [clone .localalias.2] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#12 0x000015554bb9e673 in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#13 0x000015554af7fd6b in COMGR::AMDGPUCompiler::executeInProcessDriver(llvm::ArrayRef<char const*>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#14 0x000015554af81fdc in COMGR::AMDGPUCompiler::processFile(char const*, char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#15 0x000015554af82618 in COMGR::AMDGPUCompiler::processFiles(amd_comgr_data_kind_s, char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#16 0x000015554af9356d in amd_comgr_do_action () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#17 0x0000155553e8c205 in amd::device::Program::compileAndLinkExecutable(amd_comgr_data_set_s, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, amd::option::Options*, char**, unsigned long*, amd::device::Program::file_type_t) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#18 0x0000155553e8ebd4 in amd::device::Program::linkImplLC(amd::option::Options*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#19 0x0000155553e8b141 in amd::device::Program::build(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, amd::option::Options*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#20 0x0000155553eb4b26 in amd::Program::build(std::vector<amd::Device*, std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*, bool, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#21 0x0000155553e85ded in amd::Device::BlitProgram::create(amd::Device*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#22 0x0000155553ec2edb in amd::roc::Device::createBlitProgram() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#23 0x0000155553f06aa8 in amd::roc::KernelBlitManager::createProgram(amd::roc::Device&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#24 0x0000155553edb4cd in amd::roc::VirtualGPU::create() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#25 0x0000155553ebce08 in amd::roc::Device::createVirtualDevice(amd::CommandQueue*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#26 0x0000155553ea9f74 in amd::HostQueue::HostQueue(amd::Context&, amd::Device&, unsigned long, unsigned int, amd::CommandQueue::Priority, std::vector<unsigned int, std::allocator<unsigned int> > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#27 0x0000155553e05e19 in hip::Stream::Stream(hip::Device*, hip::Stream::Priority, unsigned int, bool, std::vector<unsigned int, std::allocator<unsigned int> > const&, hipStreamCaptureStatus) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#28 0x0000155553ca1c94 in hip::Device::NullStream(bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#29 0x0000155553d4dba7 in hip::ihipMemset(void*, long, unsigned long, unsigned long, ihipStream_t*, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#30 0x0000155553d7ed2c in hip::hipMemset(void*, int, unsigned long) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#31 0x00000000004012fd in isRocmDeviceSupported(int) ()
#32 0x0000000000401190 in main ()

fwyzard avatar Feb 25 '25 22:02 fwyzard

playing with breakpoints, I can get a similar stack trace with CMSSW_15_1_X_2025-02-21-1100:

#0  0x00001555538a82a0 in _int_free () from /lib64/libc.so.6
#1  0x00001555555441ac in operator delete(void*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/lib/el8_amd64_gcc12/libPerfToolsAllocMonitorPreload.so
#2  0x000015554f8dd1fb in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::updateDepth(llvm::GenericCycle<llvm::GenericSSAContext<llvm::Function> >*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#3  0x000015554f8ddd93 in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::run(llvm::BasicBlock*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#4  0x000015554f8dfc22 in llvm::CycleInfoWrapperPass::runOnFunction(llvm::Function&) [clone .localalias.5] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#5  0x0000155550035c69 in llvm::FPPassManager::runOnFunction(llvm::Function&) [clone .localalias.4] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#6  0x0000155550035db1 in llvm::FPPassManager::runOnModule(llvm::Module&) [clone .localalias.54] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#7  0x0000155550036a7f in llvm::legacy::PassManagerImpl::run(llvm::Module&) [clone .localalias.36] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#8  0x000015554c0695ac in clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >, clang::BackendConsumer*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#9  0x000015554c0454a1 in clang::CodeGenAction::ExecuteAction() [clone .localalias.40] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#10 0x000015554db23851 in clang::FrontendAction::Execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#11 0x000015554daaf2fa in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) [clone .localalias.2] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#12 0x000015554bb9e673 in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#13 0x000015554af7fd6b in COMGR::AMDGPUCompiler::executeInProcessDriver(llvm::ArrayRef<char const*>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#14 0x000015554af81fdc in COMGR::AMDGPUCompiler::processFile(char const*, char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#15 0x000015554af82618 in COMGR::AMDGPUCompiler::processFiles(amd_comgr_data_kind_s, char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#16 0x000015554af9356d in amd_comgr_do_action () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#17 0x0000155553e8c205 in amd::device::Program::compileAndLinkExecutable(amd_comgr_data_set_s, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, amd::option::Options*, char**, unsigned long*, amd::device::Program::file_type_t) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#18 0x0000155553e8ebd4 in amd::device::Program::linkImplLC(amd::option::Options*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#19 0x0000155553e8b141 in amd::device::Program::build(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, amd::option::Options*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#20 0x0000155553eb4b26 in amd::Program::build(std::vector<amd::Device*, std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*, bool, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#21 0x0000155553e85ded in amd::Device::BlitProgram::create(amd::Device*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#22 0x0000155553ec2edb in amd::roc::Device::createBlitProgram() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#23 0x0000155553f06aa8 in amd::roc::KernelBlitManager::createProgram(amd::roc::Device&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#24 0x0000155553edb4cd in amd::roc::VirtualGPU::create() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#25 0x0000155553ebce08 in amd::roc::Device::createVirtualDevice(amd::CommandQueue*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#26 0x0000155553ea9f74 in amd::HostQueue::HostQueue(amd::Context&, amd::Device&, unsigned long, unsigned int, amd::CommandQueue::Priority, std::vector<unsigned int, std::allocator<unsigned int> > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#27 0x0000155553e05e19 in hip::Stream::Stream(hip::Device*, hip::Stream::Priority, unsigned int, bool, std::vector<unsigned int, std::allocator<unsigned int> > const&, hipStreamCaptureStatus) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#28 0x0000155553ca1c94 in hip::Device::NullStream(bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#29 0x0000155553d4dba7 in hip::ihipMemset(void*, long, unsigned long, unsigned long, ihipStream_t*, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#30 0x0000155553d7ed2c in hip::hipMemset(void*, int, unsigned long) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#31 0x00000000004012fd in isRocmDeviceSupported(int) ()
#32 0x0000000000401190 in main ()

but then if I continue, it works:

(gdb) disable  
(gdb) c
Continuing.
[Thread 0x155443dff700 (LWP 5789) exited]
[Thread 0x1555493ff700 (LWP 5787) exited]
Memory Report: total memory requested: 248083365
Memory Report:  max memory used: 14764880
Memory Report:  presently used: 0
Memory Report:  # allocations calls:   860865
Memory Report:  # deallocations calls: 869884
[Inferior 1 (process 5786) exited normally]

fwyzard avatar Feb 25 '25 22:02 fwyzard

And here is the top (bottom ?) of the stack trace with CMSSW_15_1_X_2025-02-21-2300, after rebuilding the relevant packages with debug symbols:

#0  0x00001555538a8a4c in _int_free () from /lib64/libc.so.6
#1  0x00001555555441ac in operator()<void*> (ptr=0xcf04c0, __closure=<synthetic pointer>) at src/PerfTools/AllocMonitorPreload/src/memory_proxies.cc:326
#2  cms::perftools::AllocMonitorRegistry::deallocCalled<operator delete(void*)::<lambda(auto:26)>, operator delete(void*)::<lambda(auto:27)> > (iDealloc=..., iGetActual=..., iPtr=0xcf04c0, this=0x1555555320c0 <cms::perftools::AllocMonitorRegistry::instance()::s_registry>) at /cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/src/PerfTools/AllocMonitor/interface/AllocMonitorRegistry.h:133
#3  operator delete (ptr=0xcf04c0) at src/PerfTools/AllocMonitorPreload/src/memory_proxies.cc:326
#4  operator delete (ptr=0xcf04c0) at src/PerfTools/AllocMonitorPreload/src/memory_proxies.cc:318
#5  0x000015554f8dd24d in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::updateDepth(llvm::GenericCycle<llvm::GenericSSAContext<llvm::Function> >*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
...

fwyzard avatar Feb 25 '25 22:02 fwyzard

This topic would really deserve its own issue, but before opening one I'd like to check if these parts of the stack trace shown in the issue description

#2  0x000015554f8dd24d in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::updateDepth(llvm::GenericCycle<llvm::GenericSSAContext<llvm::Function> >*) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#3  0x000015554f8ddd93 in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::run(llvm::BasicBlock*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#4  0x000015554f8dfc22 in llvm::CycleInfoWrapperPass::runOnFunction(llvm::Function&) [clone .localalias.5] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#5  0x0000155550035c69 in llvm::FPPassManager::runOnFunction(llvm::Function&) [clone .localalias.4] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#6  0x0000155550035db1 in llvm::FPPassManager::runOnModule(llvm::Module&) [clone .localalias.54] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#7  0x0000155550036a7f in llvm::legacy::PassManagerImpl::run(llvm::Module&) [clone .localalias.36] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#8  0x000015554c0695ac in clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >, clang::BackendConsumer*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#9  0x000015554c0454a1 in clang::CodeGenAction::ExecuteAction() [clone .localalias.40] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#10 0x000015554db23851 in clang::FrontendAction::Execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#11 0x000015554daaf2fa in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) [clone .localalias.2] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#12 0x000015554bb9e673 in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2

are expected to be called also from cmsRun? I'd suppose so (as the call chain starts from hip::hipMemset()), but I'd like to be sure before opening the other issue.

makortel avatar May 30 '25 16:05 makortel