[GPU] Multiple RelVals failing with memory allocation error
Hello,
There are multiple RelVals failing with the following exception in GPU IBs:
----- Begin Fatal Exception 05-Feb-2024 04:19:50 CET-----------------------
An exception of category 'StdException' occurred while
[0] Processing Event run: 366727 lumi: 89 event: 131642946 stream: 3
[1] Running path 'MC_Run3_PFScoutingPixelTracking_v22'
[2] Calling method for module HBHERecHitProducerGPU/'hltHbherecoGPU'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/5569e690981e3c5d49d7743adaadedca/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_GPU_X_2024-02-04-2300/src/HeterogeneousCore/CUDAUtilities/src/CachingDeviceAllocator.h, line 489:
cudaCheck(error = cudaMalloc(&search_key.d_ptr, search_key.bytes));
cudaErrorMemoryAllocation: out of memory
----- End Fatal Exception -------------------------------------------------
It seems caused by modifications in https://github.com/cms-sw/cmssw/pull/43804.
FYI, @iarspider
Thanks, Andrea
cms-bot internal usage
A new Issue was created by @aandvalenzuela Andrea Valenzuela.
@Dr15Jones, @sextonkennedy, @rappoccio, @makortel, @antoniovilela, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
do not see any mechanism that makes #43804 interfering with Relvals...
do not see any mechanism that makes https://github.com/cms-sw/cmssw/pull/43804 interfering with Relvals...
I think these problems started to appear in CMSSW_14_0_X_2024-01-30-2300. And in that IB there were a few updates related to the HLT menu:
- #43758 from @mmasciov: Add non-diagonal errors to scouting vertices
- #43788 from @cms-tsg-storm: HLT menu development for
13_3_X(1/N) [14_0_X] - #43294 from @PixelTracksAlpaka: Pixel Alpaka Migration: Configs and Fixes [VII]
There are also different exceptions observed e.g.:
----- Begin Fatal Exception 31-Jan-2024 04:02:11 CET-----------------------
An exception of category 'StdException' occurred while
[0] Processing Event run: 1 lumi: 2 event: 101 stream: 3
[1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_OneProng_M5to80_v2'
[2] Calling method for module CAHitNtupletCUDAPhase1/'hltPixelTracksGPU'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/bc88fe327b7ccd90d4bda9e20e6ec926/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_GPU_X_2024-01-30-2300/src/RecoTracker/PixelSeeding/plugins/CAHitNtupletGeneratorKernels.cu, line 64:
cudaCheck(cudaGetLastError());
cudaErrorMemoryAllocation: out of memory
----- End Fatal Exception -------------------------------------------------
but they all seem to be related to cudaErrorMemoryAllocation: out of memory.
assign heterogeneous, hlt
New categories assigned: heterogeneous,hlt
@Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
I think these problems started to appear in CMSSW_14_0_X_2024-01-30-2300.
Indeed
- CMSSW_14_0_X_2024-01-29-2300 showed no failures
- CMSSW_14_0_X_2024-01-30-2300 showed 6 failures
- CMSSW_14_0_X_2024-01-31-2300 showed no failures
- CMSSW_14_0_X_2024-02-02-2300 showed 5 failures
- CMSSW_14_0_X_2024-02-04-2300 showed 10 failures
(although is difficult to say what happened before 01-29)
type tracking (even though the association is not strong; it does look like related to the pixel tracking Alpaka migration)
@AdrianoDee @fwyzard Do I understand correctly that the workflows added in https://github.com/cms-sw/cmssw/pull/43294 (.402, .403, .404) are not run (yet?) in GPU IBs?
Did https://github.com/cms-sw/cmssw/pull/43788 change anything wrt. GPU modules (like more instances)? (on a cursory look I didn't catch any, but could have easily missed something subtle)
Did https://github.com/cms-sw/cmssw/pull/43788 change anything wrt. GPU modules (like more instances)?
it should have not.
@AdrianoDee @fwyzard Do I understand correctly that the workflows added in https://github.com/cms-sw/cmssw/pull/43294 (.402, .403, .404) are not run (yet?) in GPU IBs?
Yes, we haven't added them to relvals_gpu yet.
From Opensearch history I see that workflow 12434.512 step2 first failed with StdException [a] for CMSSW_14_0_GPU_X_2024-01-17-2300 . The job was running on one of HLT nodes provided by @fwyzard . This was then fixed by @fwyzard by rebooting the node.
For CMSSW_14_0_GPU_X_2024-02-02-2300 IB this workflow failed again with error cudaErrorMemoryAllocation: out of memory [b] . Changed between this release and previous GPU IB are https://github.com/cms-sw/cmssw/compare/CMSSW_14_0_GPU_X_2024-01-31-2300...CMSSW_14_0_GPU_X_2024-02-02-2300
[a]
An exception of category 'StdException' occurred while
[0] Constructing the EventProcessor
[1] Constructing service of type AlpakaServiceCudaAsync
[2] Constructing service of type CUDAService
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/5f6160ccac866104fd4106c72252358d/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_GPU_X_2024-01-17-2300/src/HeterogeneousCore/CUDAServices/plugins/CUDAService.cc, line 193:
nvmlCheck(nvmlInitWithFlags(NVML_INIT_FLAG_NO_GPUS | NVML_INIT_FLAG_NO_ATTACH));
NVML Error 18: Driver/library version mismatch
[b]
An exception of category 'StdException' occurred while
[0] Processing Event run: 1 lumi: 2 event: 101 stream: 1
[1] Running path 'MC_ReducedIterativeTracking_v16'
[2] Calling method for module CAHitNtupletCUDAPhase1/'hltPixelTracksGPU'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/bc88fe327b7ccd90d4bda9e20e6ec926/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_GPU_X_2024-01-30-2300/src/RecoTracker/PixelSeeding/plugins/CAHitNtupletGeneratorKernels.cu, line 64:
cudaCheck(cudaGetLastError());
cudaErrorMemoryAllocation: out of memory
An exception of category 'StdException' occurred while
[0] Processing Event run: 1 lumi: 2 event: 102 stream: 3
[1] Running path 'MC_ReducedIterativeTracking_v16'
[2] Calling method for module CAHitNtupletCUDAPhase1/'hltPixelTracksGPU'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/bc88fe327b7ccd90d4bda9e20e6ec926/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_GPU_X_2024-01-30-2300/src/RecoTracker/PixelSeeding/plugins/CAHitNtupletGeneratorKernels.cu, line 64:
cudaCheck(cudaGetLastError());
cudaErrorMemoryAllocation: out of memory
Note that these workflows are also randomly failing for CMSSW_13_3_X and 13_2_X Ibs, so I guess it could be related to some cuda configuration on HLT node
Looks like workflows fail when they are run on HLT node . all workflows running in HTCondor GPU nodes are running fine
All the 5 IBs I listed in https://github.com/cms-sw/cmssw/issues/43866#issuecomment-1927297597 were ran on a node that had 2 T4s. The behavior of runTheMatrix.py + cmsDriver.py + cmsRun out of the box is to have each cmsRun process to use both GPUs, which means the EventSetup data products are replicated to both devices.
If each cmsRun process would be made to use only one of the GPUs, i.e. runTheMatrix.py would distribute the workflows to the two GPUs, the tests would use less GPU memory.
Maybe we should also consider adding CUDAMonitoringService for these tests to report the GPU memory usage in similar fashion as done with SimpleMemoryCheck for CPU memory? At this stage I'd also consider extending the caching allocators to also record the peak allocated memory throughout the job.
Looks like workflows fail when they are run on HLT node . all workflows running in HTCondor GPU nodes are running fine
What GPU is available in HTCondor GPU nodes (the link gives a permission error for me)? I'd bet they have more memory than the T4 on the HLT node.
HTCondor GPU nodes have Tesla V100S-PCIE-32GB
+ nvidia-smi
Mon Feb 5 03:34:02 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100S-PCIE-32GB Off | 00000000:06:00.0 Off | 0 |
| N/A 31C P0 25W / 250W | 36MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
+ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.129.03 Thu Oct 19 18:56:32 UTC 2023
GCC version: gcc version 11.4.1 20230605 (Red Hat 11.4.1-2) (GCC)
where as hlt node has
+ nvidia-smi
Mon Feb 5 04:39:23 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:23:00.0 Off | 0 |
| N/A 41C P0 28W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:E2:00.0 Off | 0 |
| N/A 44C P0 29W / 70W | 2MiB / 15360MiB | 7% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
+ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.154.05 Thu Dec 28 15:37:48 UTC 2023
My understanding is that the normal mode of operations for the HLT nodes is to bind each cmsRun process to one socket and one GPU. Also, +1 to enabling CUDAMonitoringService.
I do not see any fix for these in cmsdist or in cmssw but last night's GPU IB tests ran successfully on hlt node. @fwyzard there any update on hlt node srv-b1b07-xx or is it just luck?
The node should be just srv-b1b07-18-01, correct ?
No, I'm not aware of any updates.
The node should be just
srv-b1b07-18-01, correct ?
yes , that is correct
was there any further occurrence of this issue? can it be closed?
I believe the testing infrastructure improvements outlined in https://github.com/cms-sw/cmssw/issues/43866#issuecomment-1927383777 are still to be addressed, so keeping an issue out for them would still be useful (but of course those can be moved to another issue if desired)