cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Improve heterogeneous HCAL code

Open fwyzard opened this issue 3 weeks ago • 23 comments

PR description:

Synchronise between iterations over HCAL channels. This addresses a silent problem reported by compute-sanitizer --tool=racechck.

Enable range checking in HCAL code.

PR validation:

Code runs.

fwyzard avatar Dec 04 '25 21:12 fwyzard

cms-bot internal usage

cmsbuild avatar Dec 04 '25 21:12 cmsbuild

enable gpu

fwyzard avatar Dec 04 '25 21:12 fwyzard

please test

fwyzard avatar Dec 04 '25 22:12 fwyzard

type bugfix

fwyzard avatar Dec 04 '25 22:12 fwyzard

type ngt

fwyzard avatar Dec 04 '25 22:12 fwyzard

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49554/47071

cmsbuild avatar Dec 04 '25 22:12 cmsbuild

A new Pull Request was created by @fwyzard for master.

It involves the following packages:

  • CondFormats/HcalObjects (alca, db)
  • RecoLocalCalo/HcalRecProducers (reconstruction)

@Alejandro1400, @JanChyczynski, @arunhep, @atpathak, @francescobrivio, @jfernan2, @mandrenguyen, @perrotta, @srimanob can you please review it and eventually sign? Thanks. @JanChyczynski, @PonIlya, @abdoulline, @apsallid, @bsunanda, @denizsun, @mariadalfonso, @mmusich, @rsreds, @salimcerci, @seemasharmafnal, @tocheng, @youyingli, @yuanchao this is something you requested to watch as well. @ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

cmsbuild avatar Dec 04 '25 22:12 cmsbuild

The two unit tests that failed are

and they seem pretty much unrelated to these changes.

fwyzard avatar Dec 06 '25 08:12 fwyzard

ignore tests-rejected with manual-override

fwyzard avatar Dec 06 '25 08:12 fwyzard

@fwyzard Excuse my inexperience with GPU computing, but could you explain, mainly for my curiosity, why the order of [] operator and field access had to be changed?

JanChyczynski avatar Dec 08 '25 16:12 JanChyczynski

Hi @JanChyczynski, of course.

The reason lies in how the common Structure of Arrays data structure used in CMSSW is implemented and what functionality it provides.

When one writes

view.property()[i]

what happens is that view.property() returns an std::span over the column "property"; then the [] operator is applied to the span to return the i-th element.

Unfortunately std::span does not check that the i-th element is actually valid: one can (try to) access any element, and there is no check that the SoA has i elements or more.

So, if any bug slips through, the code may end up accessing random memory locations - which may result in a crash or in some data corruption.

When one writes

view[i].property()

the [] operator is applied to the SoA as a whole to return a proxy to the i-th row, and then property() returns the value of "property" for that element.

In this case the [] is implemented by the CMS SoA and does check if the i-th element is valid, or prints a (hopefully) meaningful error message.

So, in case of bugs, it is much easier to figure out what is going wrong and where.

fwyzard avatar Dec 08 '25 17:12 fwyzard

Thank you for a clear explanation!

I run a quick regex search with sourcegraph to check if there are some occurences of the .property()[i] syntax left and ideed there are over 20 occurences in RecoLocalCalo/HcalRecProducers/plugins/alpaka/Mahi.dev.cc and some in HcalMahiConditionsESProducer.cc (which I don't know if is intended to be changed in this PR).

I just wanted to point it out and ask if these ocurrences are meant to stay with this syntax or should also be refactored.

JanChyczynski avatar Dec 09 '25 16:12 JanChyczynski

It would be good to refactor them as well.

In this PR I did it only for those that were leading to some errors while debugging the behaviour on AMD GPUs, and the nearby ones.

fwyzard avatar Dec 09 '25 16:12 fwyzard

please test

Let's see if the MI300X tests are working now...

fwyzard avatar Dec 09 '25 16:12 fwyzard

-1

Size: This PR adds an extra 16KB to repository Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-912365/49864/summary.html COMMIT: a1f2c7d1863d44371fba886cf51dbc3deb3ea59b CMSSW: CMSSW_16_0_X_2025-12-09-1100/el8_amd64_gcc13 Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/49554/49864/install.sh to create a dev area with all the needed externals and cmssw changes.

DAS Queries: The DAS query tests failed, see the summary page for details.

Comparison Summary

Summary:

  • You potentially removed 2 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 5 differences found in the comparisons
  • Reco comparison had 4 failed jobs
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4273241
  • DQMHistoTests: Total failures: 67
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4273154
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
  • Checked 227 log files, 198 edm output root files, 53 DQM output files
  • TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

  • You potentially removed 8 lines from the logs
  • Reco comparison results: 251 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 148855
  • DQMHistoTests: Total failures: 29967
  • DQMHistoTests: Total nulls: 10
  • DQMHistoTests: Total successes: 118878
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_H100 Comparison Summary

Summary:

  • You potentially removed 6 lines from the logs
  • Reco comparison results: 222 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 148855
  • DQMHistoTests: Total failures: 38793
  • DQMHistoTests: Total nulls: 13
  • DQMHistoTests: Total successes: 110049
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_L40S Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 248 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 148855
  • DQMHistoTests: Total failures: 30064
  • DQMHistoTests: Total nulls: 11
  • DQMHistoTests: Total successes: 118780
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar Dec 10 '25 20:12 cmsbuild

Please double check but errors still seem unrelated. Otherwise LGTM

JanChyczynski avatar Dec 11 '25 13:12 JanChyczynski

+1

JanChyczynski avatar Dec 11 '25 13:12 JanChyczynski

+1

jfernan2 avatar Dec 11 '25 14:12 jfernan2

This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @sextonkennedy, @ftenchini (and backports should be raised in the release meeting by the corresponding L2)

cmsbuild avatar Dec 11 '25 14:12 cmsbuild

-1

Size: This PR adds an extra 16KB to repository Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-912365/49864/summary.html COMMIT: a1f2c7d1863d44371fba886cf51dbc3deb3ea59b CMSSW: CMSSW_16_0_X_2025-12-09-1100/el8_amd64_gcc13 Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/49554/49864/install.sh to create a dev area with all the needed externals and cmssw changes.

DAS Queries: The DAS query tests failed, see the summary page for details.

Comparison Summary

Summary:

  • You potentially removed 2 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 5 differences found in the comparisons
  • Reco comparison had 4 failed jobs
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4273241
  • DQMHistoTests: Total failures: 67
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4273154
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
  • Checked 227 log files, 198 edm output root files, 53 DQM output files
  • TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

  • You potentially removed 2 lines from the logs
  • Reco comparison results: 242 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 148855
  • DQMHistoTests: Total failures: 28041
  • DQMHistoTests: Total nulls: 12
  • DQMHistoTests: Total successes: 120802
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

  • You potentially removed 8 lines from the logs
  • Reco comparison results: 251 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 148855
  • DQMHistoTests: Total failures: 29967
  • DQMHistoTests: Total nulls: 10
  • DQMHistoTests: Total successes: 118878
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_H100 Comparison Summary

Summary:

  • You potentially removed 6 lines from the logs
  • Reco comparison results: 222 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 148855
  • DQMHistoTests: Total failures: 38793
  • DQMHistoTests: Total nulls: 13
  • DQMHistoTests: Total successes: 110049
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_L40S Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 248 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 148855
  • DQMHistoTests: Total failures: 30064
  • DQMHistoTests: Total nulls: 11
  • DQMHistoTests: Total successes: 118780
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar Dec 11 '25 15:12 cmsbuild

The latest test shows a failure on nvidia_t4. At first glance this doesn't seem to be the same failure for which the ignore tests-rejected command was issued. @fwyzard do you confirm that this PR can be merged?

mandrenguyen avatar Dec 11 '25 20:12 mandrenguyen

The error on the T4 machine seems unrelated, but let's wait until tomorrow and rerun the tests.

fwyzard avatar Dec 11 '25 22:12 fwyzard

test parameters:

  • enable = gpu
  • gpu = nvidia_t4

fwyzard avatar Dec 13 '25 03:12 fwyzard

@fwyzard should we give the tests another go, or should we merge directly?

mandrenguyen avatar Dec 16 '25 08:12 mandrenguyen

Let's rerun the tests, the T4 machine should be back by now.

fwyzard avatar Dec 16 '25 13:12 fwyzard

please test

fwyzard avatar Dec 16 '25 13:12 fwyzard

+1

Size: This PR adds an extra 16KB to repository Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-912365/49998/summary.html COMMIT: a1f2c7d1863d44371fba886cf51dbc3deb3ea59b CMSSW: CMSSW_16_0_X_2025-12-15-2300/el8_amd64_gcc13 Additional Tests: GPU,NVIDIA_T4 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49554/49998/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 6 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 9 differences found in the comparisons
  • Reco comparison had 4 failed jobs
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4280229
  • DQMHistoTests: Total failures: 9
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4280200
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
  • Checked 227 log files, 198 edm output root files, 53 DQM output files
  • TriggerResults: no differences found

NVIDIA_T4 Comparison Summary

Summary:

  • You potentially removed 1 lines from the logs
  • Reco comparison results: 248 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 149371
  • DQMHistoTests: Total failures: 29809
  • DQMHistoTests: Total nulls: 9
  • DQMHistoTests: Total successes: 119553
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar Dec 17 '25 14:12 cmsbuild

+1

mandrenguyen avatar Dec 17 '25 15:12 mandrenguyen