root icon indicating copy to clipboard operation
root copied to clipboard

[runtime_cxxmodules] Enable on AArch64

Open hahnjo opened this issue 1 year ago • 7 comments

It was disabled in commit a67863d33a ("Disable modules on aarch64 due to ODR violation") in 2019. I cannot reproduce these problems on lxplus-arm, so try to turn it back on.

hahnjo avatar Sep 10 '24 12:09 hahnjo

Test Results

    18 files      18 suites   4d 2h 21m 3s ⏱️  2 662 tests  2 662 ✅ 0 💤 0 ❌ 46 198 runs  46 198 ✅ 0 💤 0 ❌

Results for commit b0121698.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Sep 10 '24 14:09 github-actions[bot]

I propose to merge it once we have the ARM nodes online to be able to test immediately, would this be ok?

dpiparo avatar Sep 27 '24 14:09 dpiparo

Yes, I'm waiting for the AArch64 node in our CI so we can test there, and then (after) I'd still like to ask CMS to run their tests.

hahnjo avatar Sep 27 '24 14:09 hahnjo

@aandvalenzuela @smuzaffar if you have some cycles, can you test this change with CMSSW on AArch64? This should align the configurations with x86_64 to also enable runtime_cxxmodules by default

hahnjo avatar Oct 11 '24 08:10 hahnjo

https://github.com/cms-sw/root/pull/212 is running cmssw aarch64 tests

smuzaffar avatar Oct 11 '24 08:10 smuzaffar

Hi, most of cmssw tests passed but for few relvals we get runtime errors like [a]

[a] https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7b6638/42123/runTheMatrix-results/140.063_RunZeroBias2022D/step3_RunZeroBias2022D.log

cling JIT session error: In graph cling-module-926-jitted-objectbuffer, section .text._ZNK4reco10HitPattern23numberOfLostTrackerHitsENS0_11HitCategoryE: relocation target "_ZN4reco10HitPattern16missingHitFilterEt" at address 0x4000968500f0 is out of range of Page21 fixup at 0x4001a7270114 (_ZNK4reco10HitPattern23numberOfLostTrackerHitsENS0_11HitCategoryE, 0x4001a727010c + 0x8)
----- Begin Fatal Exception 11-Oct-2024 15:08:51 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Processing  Event run: 357735 lumi: 53 event: 87840020 stream: 0
   [1] Running path 'dqmoffline_1_step'
   [2] Prefetching for module NanoAODDQM/'nanoDQM'
   [3] Prefetching for module SimplePATTauFlatTableProducer/'boostedTauTable'
   [4] Prefetching for module PATObjectCrossLinker/'linkedObjects'
   [5] Prefetching for module PATMuonRefSelector/'finalMuons'
   [6] Prefetching for module PATMuonUserDataEmbedder/'slimmedMuonsWithUserData'
   [7] Calling method for module EvaluateMuonMVAID/'muonMVAID'
   Additional Info:
      [a] Fatal Root Error: @SUB=TClingCallFunc::make_wrapper
Failed to compile
  ==== SOURCE BEGIN ====
#pragma clang diagnostic push
#pragma clang diagnostic ignored "-Wformat-security"
__attribute__((used)) __attribute__((annotate("__cling__ptrcheck(off)")))
extern "C" void __cf_365(void* obj, int nargs, void** args, void* ret)
{
   if (ret) {
      new (ret) (double) (((const reco::TrackBase*)obj)->validFraction());
      return;
   }
   else {
      (void)(((const reco::TrackBase*)obj)->validFraction());
      return;
   }
}
#pragma clang diagnostic pop
  ==== SOURCE END ====

----- End Fatal Exception -------------------------------------------------
Another exception was caught while trying to clean up files after the primary fatal exception.


smuzaffar avatar Oct 11 '24 14:10 smuzaffar

Hi, most of cmssw tests passed but for few relvals we get runtime errors [...]

Thanks for testing! This needs debugging (likely after CHEP)...

hahnjo avatar Oct 14 '24 06:10 hahnjo

Revisiting this PR before the end of the year: Thanks to the clear error message and some guess work, I managed to reproduce the issue in a standalone ROOT session:

root [0] struct A { static void f() {} void take_f(void (*fp)()) { fp(); } void pass_f() { take_f(f); } void call_f() { f(); } };
root [1] A::f()
root [2] #include <sys/mman.h>
root [3] for (int i = 0; i < 1024 * 1024; i++) { mmap(nullptr, 8192, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); }
root [4] A a;
root [5] a.call_f()
root [6] a.pass_f()
cling JIT session error: In graph cling-module-16-jitted-objectbuffer, section .text._ZN11__cling_N501A6pass_fEv: relocation target "_ZN11__cling_N501A1fEv" at address 0xffffa18a4034 is out of range of Page21 fixup at 0xfffd88b60044 (_ZN11__cling_N501A6pass_fEv, 0xfffd88b6003c + 0x8)

Currently still investigating why this happens with runtime_cxxmodules but not without...

hahnjo avatar Dec 10 '24 15:12 hahnjo

Alright, the issue with the reproducer in https://github.com/root-project/root/pull/16401#issuecomment-2532084281 is understood and fixed. Let's hope that this is also fixes the CMS relvals - @aandvalenzuela @smuzaffar could you maybe run the tests again? Thanks in advance for all your help!

hahnjo avatar Dec 11 '24 08:12 hahnjo

Alright, the issue with the reproducer in #16401 (comment) is understood and fixed. Let's hope that this is also fixes the CMS relvals - @aandvalenzuela @smuzaffar could you maybe run the tests again? Thanks in advance for all your help!

cmssw tests started via https://github.com/cms-sw/root/pull/215

smuzaffar avatar Dec 11 '24 08:12 smuzaffar

cmssw tests for aarch64 look good

smuzaffar avatar Dec 11 '24 17:12 smuzaffar

cmssw tests for aarch64 look good

Fantastic, thank you! Unfortunately our macOS nodes are not happy, so I'll need to push a slightly fixed version. I don't think we strictly need to test with full CMSSW once more...

hahnjo avatar Dec 11 '24 17:12 hahnjo

@vgvassilev FYI you already approved this three months ago, and now that the issue in CMS is fixed I plan to land this soon

hahnjo avatar Dec 12 '24 08:12 hahnjo

Go ahead.

vgvassilev avatar Dec 12 '24 08:12 vgvassilev