cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

[GPU] Multiple RelVals failing with memory allocation error

Open aandvalenzuela opened this issue 1 year ago • 23 comments

Hello,

There are multiple RelVals failing with the following exception in GPU IBs:

----- Begin Fatal Exception 05-Feb-2024 04:19:50 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 366727 lumi: 89 event: 131642946 stream: 3
   [1] Running path 'MC_Run3_PFScoutingPixelTracking_v22'
   [2] Calling method for module HBHERecHitProducerGPU/'hltHbherecoGPU'
Exception Message:
A std::exception was thrown.

/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/5569e690981e3c5d49d7743adaadedca/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_GPU_X_2024-02-04-2300/src/HeterogeneousCore/CUDAUtilities/src/CachingDeviceAllocator.h, line 489:
cudaCheck(error = cudaMalloc(&search_key.d_ptr, search_key.bytes));
cudaErrorMemoryAllocation: out of memory
----- End Fatal Exception -------------------------------------------------

It seems caused by modifications in https://github.com/cms-sw/cmssw/pull/43804.

FYI, @iarspider

Thanks, Andrea

aandvalenzuela avatar Feb 05 '24 11:02 aandvalenzuela

cms-bot internal usage

cmsbuild avatar Feb 05 '24 11:02 cmsbuild

A new Issue was created by @aandvalenzuela Andrea Valenzuela.

@Dr15Jones, @sextonkennedy, @rappoccio, @makortel, @antoniovilela, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Feb 05 '24 11:02 cmsbuild

do not see any mechanism that makes #43804 interfering with Relvals...

VinInn avatar Feb 05 '24 12:02 VinInn

do not see any mechanism that makes https://github.com/cms-sw/cmssw/pull/43804 interfering with Relvals...

I think these problems started to appear in CMSSW_14_0_X_2024-01-30-2300. And in that IB there were a few updates related to the HLT menu:

  • #43758 from @mmasciov: Add non-diagonal errors to scouting vertices
  • #43788 from @cms-tsg-storm: HLT menu development for 13_3_X (1/N) [14_0_X]
  • #43294 from @PixelTracksAlpaka: Pixel Alpaka Migration: Configs and Fixes [VII]

There are also different exceptions observed e.g.:

----- Begin Fatal Exception 31-Jan-2024 04:02:11 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 2 event: 101 stream: 3
   [1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_OneProng_M5to80_v2'
   [2] Calling method for module CAHitNtupletCUDAPhase1/'hltPixelTracksGPU'
Exception Message:
A std::exception was thrown.

/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/bc88fe327b7ccd90d4bda9e20e6ec926/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_GPU_X_2024-01-30-2300/src/RecoTracker/PixelSeeding/plugins/CAHitNtupletGeneratorKernels.cu, line 64:
cudaCheck(cudaGetLastError());
cudaErrorMemoryAllocation: out of memory
----- End Fatal Exception -------------------------------------------------

but they all seem to be related to cudaErrorMemoryAllocation: out of memory.

mmusich avatar Feb 05 '24 14:02 mmusich

assign heterogeneous, hlt

makortel avatar Feb 05 '24 15:02 makortel

New categories assigned: heterogeneous,hlt

@Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Feb 05 '24 15:02 cmsbuild

I think these problems started to appear in CMSSW_14_0_X_2024-01-30-2300.

Indeed

  • CMSSW_14_0_X_2024-01-29-2300 showed no failures
  • CMSSW_14_0_X_2024-01-30-2300 showed 6 failures
  • CMSSW_14_0_X_2024-01-31-2300 showed no failures
  • CMSSW_14_0_X_2024-02-02-2300 showed 5 failures
  • CMSSW_14_0_X_2024-02-04-2300 showed 10 failures

(although is difficult to say what happened before 01-29)

makortel avatar Feb 05 '24 15:02 makortel

type tracking (even though the association is not strong; it does look like related to the pixel tracking Alpaka migration)

slava77 avatar Feb 05 '24 15:02 slava77

@AdrianoDee @fwyzard Do I understand correctly that the workflows added in https://github.com/cms-sw/cmssw/pull/43294 (.402, .403, .404) are not run (yet?) in GPU IBs?

Did https://github.com/cms-sw/cmssw/pull/43788 change anything wrt. GPU modules (like more instances)? (on a cursory look I didn't catch any, but could have easily missed something subtle)

makortel avatar Feb 05 '24 15:02 makortel

Did https://github.com/cms-sw/cmssw/pull/43788 change anything wrt. GPU modules (like more instances)?

it should have not.

mmusich avatar Feb 05 '24 15:02 mmusich

@AdrianoDee @fwyzard Do I understand correctly that the workflows added in https://github.com/cms-sw/cmssw/pull/43294 (.402, .403, .404) are not run (yet?) in GPU IBs?

Yes, we haven't added them to relvals_gpu yet.

AdrianoDee avatar Feb 05 '24 16:02 AdrianoDee

From Opensearch history I see that workflow 12434.512 step2 first failed with StdException [a] for CMSSW_14_0_GPU_X_2024-01-17-2300 . The job was running on one of HLT nodes provided by @fwyzard . This was then fixed by @fwyzard by rebooting the node.

For CMSSW_14_0_GPU_X_2024-02-02-2300 IB this workflow failed again with error cudaErrorMemoryAllocation: out of memory [b] . Changed between this release and previous GPU IB are https://github.com/cms-sw/cmssw/compare/CMSSW_14_0_GPU_X_2024-01-31-2300...CMSSW_14_0_GPU_X_2024-02-02-2300

[a]

An exception of category 'StdException' occurred while
[0] Constructing the EventProcessor
[1] Constructing service of type AlpakaServiceCudaAsync
[2] Constructing service of type CUDAService
Exception Message:
A std::exception was thrown.

/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/5f6160ccac866104fd4106c72252358d/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_GPU_X_2024-01-17-2300/src/HeterogeneousCore/CUDAServices/plugins/CUDAService.cc, line 193:
nvmlCheck(nvmlInitWithFlags(NVML_INIT_FLAG_NO_GPUS | NVML_INIT_FLAG_NO_ATTACH));
NVML Error 18: Driver/library version mismatch

[b]

An exception of category 'StdException' occurred while
[0] Processing  Event run: 1 lumi: 2 event: 101 stream: 1
[1] Running path 'MC_ReducedIterativeTracking_v16'
[2] Calling method for module CAHitNtupletCUDAPhase1/'hltPixelTracksGPU'
Exception Message:
A std::exception was thrown.

/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/bc88fe327b7ccd90d4bda9e20e6ec926/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_GPU_X_2024-01-30-2300/src/RecoTracker/PixelSeeding/plugins/CAHitNtupletGeneratorKernels.cu, line 64:
cudaCheck(cudaGetLastError());
cudaErrorMemoryAllocation: out of memory
An exception of category 'StdException' occurred while
[0] Processing  Event run: 1 lumi: 2 event: 102 stream: 3
[1] Running path 'MC_ReducedIterativeTracking_v16'
[2] Calling method for module CAHitNtupletCUDAPhase1/'hltPixelTracksGPU'
Exception Message:
A std::exception was thrown.

/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/bc88fe327b7ccd90d4bda9e20e6ec926/opt/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_GPU_X_2024-01-30-2300/src/RecoTracker/PixelSeeding/plugins/CAHitNtupletGeneratorKernels.cu, line 64:
cudaCheck(cudaGetLastError());
cudaErrorMemoryAllocation: out of memory

smuzaffar avatar Feb 05 '24 16:02 smuzaffar

Note that these workflows are also randomly failing for CMSSW_13_3_X and 13_2_X Ibs, so I guess it could be related to some cuda configuration on HLT node

smuzaffar avatar Feb 05 '24 16:02 smuzaffar

Looks like workflows fail when they are run on HLT node . all workflows running in HTCondor GPU nodes are running fine

smuzaffar avatar Feb 05 '24 16:02 smuzaffar

All the 5 IBs I listed in https://github.com/cms-sw/cmssw/issues/43866#issuecomment-1927297597 were ran on a node that had 2 T4s. The behavior of runTheMatrix.py + cmsDriver.py + cmsRun out of the box is to have each cmsRun process to use both GPUs, which means the EventSetup data products are replicated to both devices.

If each cmsRun process would be made to use only one of the GPUs, i.e. runTheMatrix.py would distribute the workflows to the two GPUs, the tests would use less GPU memory.

Maybe we should also consider adding CUDAMonitoringService for these tests to report the GPU memory usage in similar fashion as done with SimpleMemoryCheck for CPU memory? At this stage I'd also consider extending the caching allocators to also record the peak allocated memory throughout the job.

makortel avatar Feb 05 '24 16:02 makortel

Looks like workflows fail when they are run on HLT node . all workflows running in HTCondor GPU nodes are running fine

What GPU is available in HTCondor GPU nodes (the link gives a permission error for me)? I'd bet they have more memory than the T4 on the HLT node.

makortel avatar Feb 05 '24 18:02 makortel

HTCondor GPU nodes have Tesla V100S-PCIE-32GB

+ nvidia-smi
Mon Feb  5 03:34:02 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100S-PCIE-32GB          Off | 00000000:06:00.0 Off |                    0 |
| N/A   31C    P0              25W / 250W |     36MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
+ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.129.03  Thu Oct 19 18:56:32 UTC 2023
GCC version:  gcc version 11.4.1 20230605 (Red Hat 11.4.1-2) (GCC)

where as hlt node has

+ nvidia-smi
Mon Feb  5 04:39:23 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:23:00.0 Off |                    0 |
| N/A   41C    P0              28W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:E2:00.0 Off |                    0 |
| N/A   44C    P0              29W /  70W |      2MiB / 15360MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
+ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.154.05  Thu Dec 28 15:37:48 UTC 2023

smuzaffar avatar Feb 05 '24 21:02 smuzaffar

My understanding is that the normal mode of operations for the HLT nodes is to bind each cmsRun process to one socket and one GPU. Also, +1 to enabling CUDAMonitoringService.

dan131riley avatar Feb 06 '24 12:02 dan131riley

I do not see any fix for these in cmsdist or in cmssw but last night's GPU IB tests ran successfully on hlt node. @fwyzard there any update on hlt node srv-b1b07-xx or is it just luck?

smuzaffar avatar Feb 12 '24 08:02 smuzaffar

The node should be just srv-b1b07-18-01, correct ? No, I'm not aware of any updates.

fwyzard avatar Feb 12 '24 09:02 fwyzard

The node should be just srv-b1b07-18-01, correct ?

yes , that is correct

smuzaffar avatar Feb 12 '24 09:02 smuzaffar

was there any further occurrence of this issue? can it be closed?

mmusich avatar May 17 '24 06:05 mmusich

I believe the testing infrastructure improvements outlined in https://github.com/cms-sw/cmssw/issues/43866#issuecomment-1927383777 are still to be addressed, so keeping an issue out for them would still be useful (but of course those can be moved to another issue if desired)

makortel avatar May 17 '24 18:05 makortel