gpu4pyscf icon indicating copy to clipboard operation
gpu4pyscf copied to clipboard

Limit GPU memory usage?

Open sef43 opened this issue 1 year ago • 12 comments

Hello,

When running on a GPU that might be doing something else I am sometimes seeing out of memory errors: CUDA Error of GINTint2e_jk_kernel: out of memory

Is it possible to specify a hard limit on the amount of memory used by these kernels?

sef43 avatar Jan 09 '24 13:01 sef43

GPU memory is mostly allocated via CuPy. You can set the memory limit via CuPy if you hope GPU can do something else. https://docs.cupy.dev/en/stable/user_guide/memory.html#limiting-gpu-memory-usage

Although GINT* kernels do not allocate global memory explicitly, those kernels allocate a lot of local memory for high angular momentums. Those local memory are eventually allocated on global memory. So for high angular momentums, you probably still have the 'out of memory' issue.

wxj6000 avatar Jan 09 '24 17:01 wxj6000

thank you for the explanation

sef43 avatar Jan 09 '24 20:01 sef43

Hello, I am reopening this issue.

I have found that if I turn on CUDA_MPS and limit the number of active threads with this command: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50 I find that a calculation that usually fails with the CUDA Error of GINTint2e_jk_kernel: out of memory will succeed (taking only 1.5x longer, not 2x longer)

My understanding is that this reduces the local/shared memory in use at once, stopping the errors, at the expense of runtime.

Is it possible to do a similar modification at runtime, or compile time, in the code?

Maybe these values:? https://github.com/pyscf/gpu4pyscf/blob/6474b413259a37dde1e37f7ab86dee76036698ea/gpu4pyscf/lib/gint/gint.h#L77-L81

sef43 avatar Oct 02 '24 14:10 sef43

This is a good suggestion. If you turn off some threads, there is no need to allocate local memory for those threads. We can take it as one of the possible solutions.

wxj6000 avatar Oct 03 '24 00:10 wxj6000

Hi,

First of all, I’m absolutely blown away by the performance of GPU4PySCF—thank you for this amazing tool!

I have a beginner question regarding an issue I encountered. I’m running a torsional scan similar to the provided example, and it generally works well for several iterations. However, at some point, I get the following error:

CUDA Error of GINTint2e_jk_kernel: out of memory

This happens on our cluster with an A100 40GB GPU. Since my molecule isn’t very large (24 atoms) and it runs fine for multiple iterations before failing, I’m a bit confused. Is there a way to free up memory between iterations to prevent this issue?

Full code:

import time

import pyscf
from pyscf import lib
from pyscf.geomopt.geometric_solver import optimize

from gpu4pyscf.dft import rks

atom = '''
  C       0.724002      1.135021     -0.907355
  O      -0.356123      0.965447     -0.024473
  C      -0.744599     -0.386333      0.152087
  C       0.396187     -1.157444      0.792032
  O       0.011790     -2.507030      0.899684
  C       1.644622     -1.020028     -0.053519
  C       1.948759      0.441464     -0.321131
  N       3.069963      0.492310     -1.261050
  O       2.695457     -1.654767      0.636744
  C      -1.987816     -0.375699      1.005567
  O      -3.055730      0.286385      0.366128
  O       0.929643      2.485897     -1.104082
  H      -0.977695     -0.828245     -0.823532
  H       0.596763     -0.736811      1.783422
  H       1.468667     -1.522896     -1.009604
  H       2.212953      0.934306      0.618699
  H       0.481425      0.707737     -1.884177
  H       1.156487      2.903388     -0.265770
  H       3.435645     -1.785156      0.038141
  H       0.756639     -3.006876      1.245376
  H      -1.757549      0.093793      1.965767
  H      -2.306214     -1.397909      1.189633
  H      -2.790237      1.193944      0.194295
  N       3.757659      1.504438     -1.211762
  N       4.455331      2.386355     -1.246447
'''

xc = 'B3LYP'
bas = '6-311++G(2d,2p)'

scf_tol = 1e-10
max_scf_cycles = 200
screen_tol = 1e-14
grids_level = 3
mol = pyscf.M(atom=atom, basis=bas, max_memory=120000)

mol.verbose = 1
mf_GPU = rks.RKS(mol, xc=xc).density_fit()
mf_GPU.grids.level = grids_level

mf_GPU.conv_tol = scf_tol
mf_GPU.max_cycle = max_scf_cycles
mf_GPU.screen_tol = screen_tol

gradients = []

start_time = time.time()
# Content of geometric_scan.txt:
# $scan
# dihedral 1 7 8 24 90 -240 20
mol_eq = optimize(
    mf_GPU,
    maxsteps=500000000,
    constraints='geometric_scan.txt',  # atom index is 1-based in this file
)
print("Optimized coordinate:")
print(mol_eq.atom_coords())
print(time.time() - start_time)

Tillsten avatar Jan 30 '25 13:01 Tillsten

This shows the memory useage of a run: Image

Tillsten avatar Jan 30 '25 14:01 Tillsten

@Tillsten Thank you for the feedback!

The geometry optimization is converged in 10 iterations on my side. It took about 80 seconds on V100-32GB. I was using the constraints commented in your script. I assumed you were using the same.

Most GPU memory is released between optimization iterations. As shown in the above figure, the GPU memory usage is almost constant in the first few iterations. However, it blew up at 14:13:30. It is probably due to the failure of optimization. Can you share the log of GeomeTRIC?

> === End Optimization Info ===
/usr/local/lib/python3.9/dist-packages/pyscf/dft/libxc.py:512: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
  warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '
Step    0 : Gradient = 5.172e-03/1.051e-02 (rms/max) Energy = -775.8066209104
Hessian Eigenvalues: 2.30000e-02 2.30000e-02 2.30000e-02 ... 5.52941e-01 9.33540e-01 1.53696e+00
Step    1 : Displace = 3.914e-02/1.087e-01 (rms/max) Trust = 1.000e-01 (=) Grad = 1.998e-03/3.784e-03 (rms/max) E (change) = -775.8082899754 (-1.669e-03) Quality = 0.908
Hessian Eigenvalues: 2.12970e-02 2.30000e-02 2.30000e-02 ... 5.52878e-01 9.31823e-01 1.52869e+00
Step    2 : Displace = 1.072e-02/2.532e-02 (rms/max) Trust = 1.414e-01 (+) Grad = 9.165e-04/1.778e-03 (rms/max) E (change) = -775.8085107051 (-2.207e-04) Quality = 1.433
Hessian Eigenvalues: 1.05437e-02 2.30000e-02 2.30000e-02 ... 5.53047e-01 9.37131e-01 1.54934e+00
Step    3 : Displace = 1.745e-02/4.571e-02 (rms/max) Trust = 2.000e-01 (+) Grad = 9.438e-04/2.310e-03 (rms/max) E (change) = -775.8086666702 (-1.560e-04) Quality = 1.273
Hessian Eigenvalues: 5.85061e-03 2.29966e-02 2.30000e-02 ... 5.53238e-01 9.35494e-01 1.55559e+00
Step    4 : Displace = 1.261e-02/3.691e-02 (rms/max) Trust = 2.828e-01 (+) Grad = 8.230e-04/1.613e-03 (rms/max) E (change) = -775.8087314393 (-6.477e-05) Quality = 1.324
Hessian Eigenvalues: 4.13613e-03 2.29893e-02 2.30000e-02 ... 5.53154e-01 9.37752e-01 1.53192e+00
Step    5 : Displace = 7.848e-03/2.525e-02 (rms/max) Trust = 3.000e-01 (+) Grad = 4.205e-04/9.524e-04 (rms/max) E (change) = -775.8087625458 (-3.111e-05) Quality = 1.283
Hessian Eigenvalues: 3.88309e-03 2.22879e-02 2.30000e-02 ... 5.53159e-01 9.39416e-01 1.54060e+00
Step    6 : Displace = 3.049e-03/8.701e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 2.010e-04/5.087e-04 (rms/max) E (change) = -775.8087720689 (-9.523e-06) Quality = 1.448
Hessian Eigenvalues: 3.88031e-03 1.49010e-02 2.29995e-02 ... 5.53376e-01 9.34101e-01 1.55097e+00
Step    7 : Displace = 2.711e-03/5.102e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 1.335e-04/2.866e-04 (rms/max) E (change) = -775.8087761481 (-4.079e-06) Quality = 1.594
Hessian Eigenvalues: 3.83858e-03 8.80474e-03 2.29987e-02 ... 5.53288e-01 9.35053e-01 1.53620e+00
Step    8 : Displace = 2.335e-03/5.234e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 1.060e-04/2.560e-04 (rms/max) E (change) = -775.8087779904 (-1.842e-06) Quality = 1.651
Hessian Eigenvalues: 3.72867e-03 6.39808e-03 2.29962e-02 ... 5.53298e-01 9.40494e-01 1.54148e+00
Step    9 : Displace = 1.578e-03/3.654e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 6.898e-05/1.693e-04 (rms/max) E (change) = -775.8087786849 (-6.945e-07) Quality = 1.355
Hessian Eigenvalues: 3.61099e-03 5.70565e-03 2.19036e-02 ... 5.53399e-01 9.36808e-01 1.55393e+00
Step   10 : Displace = 7.862e-04/1.669e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 3.845e-05/9.810e-05 (rms/max) E (change) = -775.8087787514 (-6.648e-08) Quality = 0.291
Hessian Eigenvalues: 3.61099e-03 5.70565e-03 2.19036e-02 ... 5.53399e-01 9.36808e-01 1.55393e+00
Converged! =D

wxj6000 avatar Jan 31 '25 08:01 wxj6000

I attachted a log from a run. slurm-950146.log

Tillsten avatar Jan 31 '25 11:01 Tillsten

@Tillsten We had a major improvement for the density fitting modules. The OOM issue should be largely resolved in the latest release (v1.4.0). Here is the memory usage of the above example. Enjoy!

Image

wxj6000 avatar Apr 22 '25 02:04 wxj6000

Hi. I am trying to do a frequency calculation for this palladium complex using DFT but getting CUDA out of memory error. Can the developers suggest something here. Thanks

Input file run.py

machine:

ubuntu@pka-tj:~$ nvidia-smi
Wed Oct 29 02:09:32 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100S-PCIE-32GB          Off |   00000000:00:06.0 Off |                    0 |
| N/A   32C    P0             27W /  250W |       1MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

coordinates:


Pd    -0.196540    1.139105    0.114712
N      0.802749    2.793938    0.876151
P     -1.067789   -0.913774   -0.605682
C      0.193026   -2.022554   -1.504026
C      0.488060   -1.454047   -2.904804
C     -2.440858   -1.137185   -1.910818
C     -1.902944   -0.625169   -3.261951
H     -2.704557   -0.685221   -4.004814
H     -1.569202    0.414044   -3.210827
C     -1.552196   -1.919464    0.883363
C     -1.925595   -3.283444    0.770532
H     -0.555720   -0.992599   -4.721353
H     -0.830116   -2.483253   -4.148174
H      0.713403   -4.241109   -0.047574
H      2.487655   -0.521926   -1.228064
H      2.802224   -4.862000    1.129414
C     -2.172423   -4.056051    1.899720
H     -2.434797   -5.102667    1.811834
C     -2.094725   -3.479631    3.167344
H     -2.307960   -4.089937    4.035986
C     -1.773675   -2.133316    3.301512
O     -1.765027   -1.480869    4.502410
H      1.973517   -0.409255    4.110858
H     -3.534777   -0.152988    1.117625
H     -3.968582    1.609963    1.445179
H      0.734487   -1.582184    3.091476
C     -1.469515   -1.338105    2.164681
C     -1.098673    0.094705    2.455584
C      0.036537    0.378354    3.293487
N      0.969914   -0.596246    3.623194
H     -3.875307    0.999143   -2.975504
C     -5.958780   -1.049207   -0.448548
H     -7.594597    0.261451   -0.973263
H     -6.224940    1.578180   -2.583608
C     -4.622035   -1.373425   -0.667696
C     -4.441883    0.420933   -2.255972
C     -5.782975    0.748413   -2.039501
C     -3.838016   -0.648266   -1.579715
C      0.183259    1.672619    3.825405
H      1.075381    1.927878    4.380708
C     -0.758702    2.656932    3.581091
C     -1.904509    2.385143    2.842524
H     -2.653061    3.153954    2.701437
C     -2.099442    1.109383    2.305928
N     -3.331393    0.768424    1.667027
C      1.559225   -3.559298   -0.048607
H      6.023291    2.010518    0.872604
C      1.450617   -2.335599   -0.726362
C      2.552383   -1.470112   -0.714600
C      3.735941   -1.816653   -0.062691
C      3.831808   -3.040341    0.602375
H     -6.544755   -1.636079    0.253447
H     -4.181361   -2.211471   -0.139805
C      0.357804    1.929871   -1.683092
C     -0.553952    2.676731   -2.454585
C     -1.950120    2.966663   -1.961799
C     -0.129152    3.208019   -3.682859
H     -0.839227    3.774150   -4.282212
C      1.177651    3.040996   -4.139222
C      2.092418    2.353342   -3.344685
H      3.126431    2.242070   -3.659581
C      1.680577    1.814361   -2.122922
H      0.515464    3.637939    0.389674
H      1.479640    3.459645   -5.094843
C      4.970155    2.226777    0.794070
C      4.128545    1.320211    1.448953
C      2.765824    1.544011    1.482897
C      2.174597    2.675868    0.863563
C      3.045678    3.621237    0.240912
C      4.416767    3.370769    0.211268
H     -2.141573   -3.849112   -0.223590
H      4.545238    0.434121    1.917932
C      2.472930    4.837882   -0.440679
H      5.071890    4.069838   -0.299553
C      2.736521   -3.908251    0.613015
H      4.575519   -1.128198   -0.077253
H      4.752737   -3.316787    1.107624
H     -0.600945    3.653017    3.983858
H      1.251230   -2.078801   -3.380736
H      0.877081   -0.435547   -2.866640
C     -0.743591   -1.460649   -3.802012
H     -2.588344    3.368921   -2.755498
H      2.421530    1.342031   -1.488414
H     -2.423718    2.070694   -1.557253
H     -1.925653    3.702451   -1.148286
H      1.910220    5.470639    0.258382
H      3.263180    5.453621   -0.878222
H      1.787990    4.549315   -1.247994
H      2.116295    0.835483    1.973425
H     -2.517919   -2.222514   -1.992871
H     -0.339813   -2.961540   -1.652061
C     -6.550164    0.013674   -1.137426
H     -1.734850   -2.262013    5.261837

tejender-acog avatar Oct 28 '25 16:10 tejender-acog

can someone please suggest something here. Thanks in advance.

tejender-acog avatar Nov 01 '25 08:11 tejender-acog

Is cutensor installed in your system? Without cutensor, the density fitting code would cause OOM sometimes.

For molecules of this size, you can just use the normal rks.RKS(mol) and its hessian implementation without using density fitting. This implementation does not have memory usage issue.

sunqm avatar Nov 06 '25 18:11 sunqm