Improve heterogeneous HCAL code
PR description:
Synchronise between iterations over HCAL channels.
This addresses a silent problem reported by compute-sanitizer --tool=racechck.
Enable range checking in HCAL code.
PR validation:
Code runs.
cms-bot internal usage
enable gpu
please test
type bugfix
type ngt
+code-checks
Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49554/47071
A new Pull Request was created by @fwyzard for master.
It involves the following packages:
- CondFormats/HcalObjects (alca, db)
- RecoLocalCalo/HcalRecProducers (reconstruction)
@Alejandro1400, @JanChyczynski, @arunhep, @atpathak, @francescobrivio, @jfernan2, @mandrenguyen, @perrotta, @srimanob can you please review it and eventually sign? Thanks. @JanChyczynski, @PonIlya, @abdoulline, @apsallid, @bsunanda, @denizsun, @mariadalfonso, @mmusich, @rsreds, @salimcerci, @seemasharmafnal, @tocheng, @youyingli, @yuanchao this is something you requested to watch as well. @ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here
The two unit tests that failed are
- src/IOPool/Input/test/TestIOPoolInputSchemaEvolution
- src/FWIO/RNTupleTempTests/test/TestRNTupleTempEventHistory
and they seem pretty much unrelated to these changes.
ignore tests-rejected with manual-override
@fwyzard Excuse my inexperience with GPU computing, but could you explain, mainly for my curiosity, why the order of [] operator and field access had to be changed?
Hi @JanChyczynski, of course.
The reason lies in how the common Structure of Arrays data structure used in CMSSW is implemented and what functionality it provides.
When one writes
view.property()[i]
what happens is that view.property() returns an std::span over the column "property"; then the [] operator is applied to the span to return the i-th element.
Unfortunately std::span does not check that the i-th element is actually valid: one can (try to) access any element, and there is no check that the SoA has i elements or more.
So, if any bug slips through, the code may end up accessing random memory locations - which may result in a crash or in some data corruption.
When one writes
view[i].property()
the [] operator is applied to the SoA as a whole to return a proxy to the i-th row, and then property() returns the value of "property" for that element.
In this case the [] is implemented by the CMS SoA and does check if the i-th element is valid, or prints a (hopefully) meaningful error message.
So, in case of bugs, it is much easier to figure out what is going wrong and where.
Thank you for a clear explanation!
I run a quick regex search with sourcegraph to check if there are some occurences of the .property()[i] syntax left and ideed there are over 20 occurences in RecoLocalCalo/HcalRecProducers/plugins/alpaka/Mahi.dev.cc and some in HcalMahiConditionsESProducer.cc (which I don't know if is intended to be changed in this PR).
I just wanted to point it out and ask if these ocurrences are meant to stay with this syntax or should also be refactored.
It would be good to refactor them as well.
In this PR I did it only for those that were leading to some errors while debugging the behaviour on AMD GPUs, and the nearby ones.
please test
Let's see if the MI300X tests are working now...
-1
Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-912365/49864/summary.html
COMMIT: a1f2c7d1863d44371fba886cf51dbc3deb3ea59b
CMSSW: CMSSW_16_0_X_2025-12-09-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/49554/49864/install.sh to create a dev area with all the needed externals and cmssw changes.
DAS Queries: The DAS query tests failed, see the summary page for details.
Comparison Summary
Summary:
- You potentially removed 2 lines from the logs
- ROOTFileChecks: Some differences in event products or their sizes found
- Reco comparison results: 5 differences found in the comparisons
- Reco comparison had 4 failed jobs
- DQMHistoTests: Total files compared: 53
- DQMHistoTests: Total histograms compared: 4273241
- DQMHistoTests: Total failures: 67
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 4273154
- DQMHistoTests: Total skipped: 20
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
- Checked 227 log files, 198 edm output root files, 53 DQM output files
- TriggerResults: no differences found
AMD_W7900 Comparison Summary
Summary:
- You potentially removed 8 lines from the logs
- Reco comparison results: 251 differences found in the comparisons
- Reco comparison had 6 failed jobs
- DQMHistoTests: Total files compared: 11
- DQMHistoTests: Total histograms compared: 148855
- DQMHistoTests: Total failures: 29967
- DQMHistoTests: Total nulls: 10
- DQMHistoTests: Total successes: 118878
- DQMHistoTests: Total skipped: 0
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
- Checked 42 log files, 45 edm output root files, 11 DQM output files
- TriggerResults: no differences found
NVIDIA_H100 Comparison Summary
Summary:
- You potentially removed 6 lines from the logs
- Reco comparison results: 222 differences found in the comparisons
- Reco comparison had 6 failed jobs
- DQMHistoTests: Total files compared: 11
- DQMHistoTests: Total histograms compared: 148855
- DQMHistoTests: Total failures: 38793
- DQMHistoTests: Total nulls: 13
- DQMHistoTests: Total successes: 110049
- DQMHistoTests: Total skipped: 0
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
- Checked 42 log files, 45 edm output root files, 11 DQM output files
- TriggerResults: no differences found
NVIDIA_L40S Comparison Summary
Summary:
- No significant changes to the logs found
- Reco comparison results: 248 differences found in the comparisons
- Reco comparison had 6 failed jobs
- DQMHistoTests: Total files compared: 11
- DQMHistoTests: Total histograms compared: 148855
- DQMHistoTests: Total failures: 30064
- DQMHistoTests: Total nulls: 11
- DQMHistoTests: Total successes: 118780
- DQMHistoTests: Total skipped: 0
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
- Checked 42 log files, 45 edm output root files, 11 DQM output files
- TriggerResults: no differences found
Please double check but errors still seem unrelated. Otherwise LGTM
+1
+1
This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @sextonkennedy, @ftenchini (and backports should be raised in the release meeting by the corresponding L2)
-1
Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-912365/49864/summary.html
COMMIT: a1f2c7d1863d44371fba886cf51dbc3deb3ea59b
CMSSW: CMSSW_16_0_X_2025-12-09-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/49554/49864/install.sh to create a dev area with all the needed externals and cmssw changes.
DAS Queries: The DAS query tests failed, see the summary page for details.
Comparison Summary
Summary:
- You potentially removed 2 lines from the logs
- ROOTFileChecks: Some differences in event products or their sizes found
- Reco comparison results: 5 differences found in the comparisons
- Reco comparison had 4 failed jobs
- DQMHistoTests: Total files compared: 53
- DQMHistoTests: Total histograms compared: 4273241
- DQMHistoTests: Total failures: 67
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 4273154
- DQMHistoTests: Total skipped: 20
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
- Checked 227 log files, 198 edm output root files, 53 DQM output files
- TriggerResults: no differences found
AMD_MI300X Comparison Summary
Summary:
- You potentially removed 2 lines from the logs
- Reco comparison results: 242 differences found in the comparisons
- Reco comparison had 6 failed jobs
- DQMHistoTests: Total files compared: 11
- DQMHistoTests: Total histograms compared: 148855
- DQMHistoTests: Total failures: 28041
- DQMHistoTests: Total nulls: 12
- DQMHistoTests: Total successes: 120802
- DQMHistoTests: Total skipped: 0
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
- Checked 42 log files, 45 edm output root files, 11 DQM output files
- TriggerResults: no differences found
AMD_W7900 Comparison Summary
Summary:
- You potentially removed 8 lines from the logs
- Reco comparison results: 251 differences found in the comparisons
- Reco comparison had 6 failed jobs
- DQMHistoTests: Total files compared: 11
- DQMHistoTests: Total histograms compared: 148855
- DQMHistoTests: Total failures: 29967
- DQMHistoTests: Total nulls: 10
- DQMHistoTests: Total successes: 118878
- DQMHistoTests: Total skipped: 0
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
- Checked 42 log files, 45 edm output root files, 11 DQM output files
- TriggerResults: no differences found
NVIDIA_H100 Comparison Summary
Summary:
- You potentially removed 6 lines from the logs
- Reco comparison results: 222 differences found in the comparisons
- Reco comparison had 6 failed jobs
- DQMHistoTests: Total files compared: 11
- DQMHistoTests: Total histograms compared: 148855
- DQMHistoTests: Total failures: 38793
- DQMHistoTests: Total nulls: 13
- DQMHistoTests: Total successes: 110049
- DQMHistoTests: Total skipped: 0
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
- Checked 42 log files, 45 edm output root files, 11 DQM output files
- TriggerResults: no differences found
NVIDIA_L40S Comparison Summary
Summary:
- No significant changes to the logs found
- Reco comparison results: 248 differences found in the comparisons
- Reco comparison had 6 failed jobs
- DQMHistoTests: Total files compared: 11
- DQMHistoTests: Total histograms compared: 148855
- DQMHistoTests: Total failures: 30064
- DQMHistoTests: Total nulls: 11
- DQMHistoTests: Total successes: 118780
- DQMHistoTests: Total skipped: 0
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
- Checked 42 log files, 45 edm output root files, 11 DQM output files
- TriggerResults: no differences found
The latest test shows a failure on nvidia_t4. At first glance this doesn't seem to be the same failure for which the ignore tests-rejected command was issued. @fwyzard do you confirm that this PR can be merged?
The error on the T4 machine seems unrelated, but let's wait until tomorrow and rerun the tests.
test parameters:
- enable = gpu
- gpu = nvidia_t4
@fwyzard should we give the tests another go, or should we merge directly?
Let's rerun the tests, the T4 machine should be back by now.
please test
+1
Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-912365/49998/summary.html
COMMIT: a1f2c7d1863d44371fba886cf51dbc3deb3ea59b
CMSSW: CMSSW_16_0_X_2025-12-15-2300/el8_amd64_gcc13
Additional Tests: GPU,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49554/49998/install.sh to create a dev area with all the needed externals and cmssw changes.
Comparison Summary
Summary:
- You potentially added 6 lines to the logs
- ROOTFileChecks: Some differences in event products or their sizes found
- Reco comparison results: 9 differences found in the comparisons
- Reco comparison had 4 failed jobs
- DQMHistoTests: Total files compared: 53
- DQMHistoTests: Total histograms compared: 4280229
- DQMHistoTests: Total failures: 9
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 4280200
- DQMHistoTests: Total skipped: 20
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
- Checked 227 log files, 198 edm output root files, 53 DQM output files
- TriggerResults: no differences found
NVIDIA_T4 Comparison Summary
Summary:
- You potentially removed 1 lines from the logs
- Reco comparison results: 248 differences found in the comparisons
- Reco comparison had 6 failed jobs
- DQMHistoTests: Total files compared: 11
- DQMHistoTests: Total histograms compared: 149371
- DQMHistoTests: Total failures: 29809
- DQMHistoTests: Total nulls: 9
- DQMHistoTests: Total successes: 119553
- DQMHistoTests: Total skipped: 0
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
- Checked 42 log files, 45 edm output root files, 11 DQM output files
- TriggerResults: no differences found
+1