Limit GPU memory usage?
Hello,
When running on a GPU that might be doing something else I am sometimes seeing out of memory errors:
CUDA Error of GINTint2e_jk_kernel: out of memory
Is it possible to specify a hard limit on the amount of memory used by these kernels?
GPU memory is mostly allocated via CuPy. You can set the memory limit via CuPy if you hope GPU can do something else. https://docs.cupy.dev/en/stable/user_guide/memory.html#limiting-gpu-memory-usage
Although GINT* kernels do not allocate global memory explicitly, those kernels allocate a lot of local memory for high angular momentums. Those local memory are eventually allocated on global memory. So for high angular momentums, you probably still have the 'out of memory' issue.
thank you for the explanation
Hello, I am reopening this issue.
I have found that if I turn on CUDA_MPS and limit the number of active threads with this command:
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50
I find that a calculation that usually fails with the CUDA Error of GINTint2e_jk_kernel: out of memory will succeed (taking only 1.5x longer, not 2x longer)
My understanding is that this reduces the local/shared memory in use at once, stopping the errors, at the expense of runtime.
Is it possible to do a similar modification at runtime, or compile time, in the code?
Maybe these values:? https://github.com/pyscf/gpu4pyscf/blob/6474b413259a37dde1e37f7ab86dee76036698ea/gpu4pyscf/lib/gint/gint.h#L77-L81
This is a good suggestion. If you turn off some threads, there is no need to allocate local memory for those threads. We can take it as one of the possible solutions.
Hi,
First of all, I’m absolutely blown away by the performance of GPU4PySCF—thank you for this amazing tool!
I have a beginner question regarding an issue I encountered. I’m running a torsional scan similar to the provided example, and it generally works well for several iterations. However, at some point, I get the following error:
CUDA Error of GINTint2e_jk_kernel: out of memory
This happens on our cluster with an A100 40GB GPU. Since my molecule isn’t very large (24 atoms) and it runs fine for multiple iterations before failing, I’m a bit confused. Is there a way to free up memory between iterations to prevent this issue?
Full code:
import time
import pyscf
from pyscf import lib
from pyscf.geomopt.geometric_solver import optimize
from gpu4pyscf.dft import rks
atom = '''
C 0.724002 1.135021 -0.907355
O -0.356123 0.965447 -0.024473
C -0.744599 -0.386333 0.152087
C 0.396187 -1.157444 0.792032
O 0.011790 -2.507030 0.899684
C 1.644622 -1.020028 -0.053519
C 1.948759 0.441464 -0.321131
N 3.069963 0.492310 -1.261050
O 2.695457 -1.654767 0.636744
C -1.987816 -0.375699 1.005567
O -3.055730 0.286385 0.366128
O 0.929643 2.485897 -1.104082
H -0.977695 -0.828245 -0.823532
H 0.596763 -0.736811 1.783422
H 1.468667 -1.522896 -1.009604
H 2.212953 0.934306 0.618699
H 0.481425 0.707737 -1.884177
H 1.156487 2.903388 -0.265770
H 3.435645 -1.785156 0.038141
H 0.756639 -3.006876 1.245376
H -1.757549 0.093793 1.965767
H -2.306214 -1.397909 1.189633
H -2.790237 1.193944 0.194295
N 3.757659 1.504438 -1.211762
N 4.455331 2.386355 -1.246447
'''
xc = 'B3LYP'
bas = '6-311++G(2d,2p)'
scf_tol = 1e-10
max_scf_cycles = 200
screen_tol = 1e-14
grids_level = 3
mol = pyscf.M(atom=atom, basis=bas, max_memory=120000)
mol.verbose = 1
mf_GPU = rks.RKS(mol, xc=xc).density_fit()
mf_GPU.grids.level = grids_level
mf_GPU.conv_tol = scf_tol
mf_GPU.max_cycle = max_scf_cycles
mf_GPU.screen_tol = screen_tol
gradients = []
start_time = time.time()
# Content of geometric_scan.txt:
# $scan
# dihedral 1 7 8 24 90 -240 20
mol_eq = optimize(
mf_GPU,
maxsteps=500000000,
constraints='geometric_scan.txt', # atom index is 1-based in this file
)
print("Optimized coordinate:")
print(mol_eq.atom_coords())
print(time.time() - start_time)
This shows the memory useage of a run:
@Tillsten Thank you for the feedback!
The geometry optimization is converged in 10 iterations on my side. It took about 80 seconds on V100-32GB. I was using the constraints commented in your script. I assumed you were using the same.
Most GPU memory is released between optimization iterations. As shown in the above figure, the GPU memory usage is almost constant in the first few iterations. However, it blew up at 14:13:30. It is probably due to the failure of optimization. Can you share the log of GeomeTRIC?
> === End Optimization Info ===
/usr/local/lib/python3.9/dist-packages/pyscf/dft/libxc.py:512: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '
Step 0 : Gradient = 5.172e-03/1.051e-02 (rms/max) Energy = -775.8066209104
Hessian Eigenvalues: 2.30000e-02 2.30000e-02 2.30000e-02 ... 5.52941e-01 9.33540e-01 1.53696e+00
Step 1 : Displace = 3.914e-02/1.087e-01 (rms/max) Trust = 1.000e-01 (=) Grad = 1.998e-03/3.784e-03 (rms/max) E (change) = -775.8082899754 (-1.669e-03) Quality = 0.908
Hessian Eigenvalues: 2.12970e-02 2.30000e-02 2.30000e-02 ... 5.52878e-01 9.31823e-01 1.52869e+00
Step 2 : Displace = 1.072e-02/2.532e-02 (rms/max) Trust = 1.414e-01 (+) Grad = 9.165e-04/1.778e-03 (rms/max) E (change) = -775.8085107051 (-2.207e-04) Quality = 1.433
Hessian Eigenvalues: 1.05437e-02 2.30000e-02 2.30000e-02 ... 5.53047e-01 9.37131e-01 1.54934e+00
Step 3 : Displace = 1.745e-02/4.571e-02 (rms/max) Trust = 2.000e-01 (+) Grad = 9.438e-04/2.310e-03 (rms/max) E (change) = -775.8086666702 (-1.560e-04) Quality = 1.273
Hessian Eigenvalues: 5.85061e-03 2.29966e-02 2.30000e-02 ... 5.53238e-01 9.35494e-01 1.55559e+00
Step 4 : Displace = 1.261e-02/3.691e-02 (rms/max) Trust = 2.828e-01 (+) Grad = 8.230e-04/1.613e-03 (rms/max) E (change) = -775.8087314393 (-6.477e-05) Quality = 1.324
Hessian Eigenvalues: 4.13613e-03 2.29893e-02 2.30000e-02 ... 5.53154e-01 9.37752e-01 1.53192e+00
Step 5 : Displace = 7.848e-03/2.525e-02 (rms/max) Trust = 3.000e-01 (+) Grad = 4.205e-04/9.524e-04 (rms/max) E (change) = -775.8087625458 (-3.111e-05) Quality = 1.283
Hessian Eigenvalues: 3.88309e-03 2.22879e-02 2.30000e-02 ... 5.53159e-01 9.39416e-01 1.54060e+00
Step 6 : Displace = 3.049e-03/8.701e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 2.010e-04/5.087e-04 (rms/max) E (change) = -775.8087720689 (-9.523e-06) Quality = 1.448
Hessian Eigenvalues: 3.88031e-03 1.49010e-02 2.29995e-02 ... 5.53376e-01 9.34101e-01 1.55097e+00
Step 7 : Displace = 2.711e-03/5.102e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 1.335e-04/2.866e-04 (rms/max) E (change) = -775.8087761481 (-4.079e-06) Quality = 1.594
Hessian Eigenvalues: 3.83858e-03 8.80474e-03 2.29987e-02 ... 5.53288e-01 9.35053e-01 1.53620e+00
Step 8 : Displace = 2.335e-03/5.234e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 1.060e-04/2.560e-04 (rms/max) E (change) = -775.8087779904 (-1.842e-06) Quality = 1.651
Hessian Eigenvalues: 3.72867e-03 6.39808e-03 2.29962e-02 ... 5.53298e-01 9.40494e-01 1.54148e+00
Step 9 : Displace = 1.578e-03/3.654e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 6.898e-05/1.693e-04 (rms/max) E (change) = -775.8087786849 (-6.945e-07) Quality = 1.355
Hessian Eigenvalues: 3.61099e-03 5.70565e-03 2.19036e-02 ... 5.53399e-01 9.36808e-01 1.55393e+00
Step 10 : Displace = 7.862e-04/1.669e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 3.845e-05/9.810e-05 (rms/max) E (change) = -775.8087787514 (-6.648e-08) Quality = 0.291
Hessian Eigenvalues: 3.61099e-03 5.70565e-03 2.19036e-02 ... 5.53399e-01 9.36808e-01 1.55393e+00
Converged! =D
I attachted a log from a run. slurm-950146.log
@Tillsten We had a major improvement for the density fitting modules. The OOM issue should be largely resolved in the latest release (v1.4.0). Here is the memory usage of the above example. Enjoy!
Hi. I am trying to do a frequency calculation for this palladium complex using DFT but getting CUDA out of memory error. Can the developers suggest something here. Thanks
Input file run.py
machine:
ubuntu@pka-tj:~$ nvidia-smi
Wed Oct 29 02:09:32 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100S-PCIE-32GB Off | 00000000:00:06.0 Off | 0 |
| N/A 32C P0 27W / 250W | 1MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
coordinates:
Pd -0.196540 1.139105 0.114712
N 0.802749 2.793938 0.876151
P -1.067789 -0.913774 -0.605682
C 0.193026 -2.022554 -1.504026
C 0.488060 -1.454047 -2.904804
C -2.440858 -1.137185 -1.910818
C -1.902944 -0.625169 -3.261951
H -2.704557 -0.685221 -4.004814
H -1.569202 0.414044 -3.210827
C -1.552196 -1.919464 0.883363
C -1.925595 -3.283444 0.770532
H -0.555720 -0.992599 -4.721353
H -0.830116 -2.483253 -4.148174
H 0.713403 -4.241109 -0.047574
H 2.487655 -0.521926 -1.228064
H 2.802224 -4.862000 1.129414
C -2.172423 -4.056051 1.899720
H -2.434797 -5.102667 1.811834
C -2.094725 -3.479631 3.167344
H -2.307960 -4.089937 4.035986
C -1.773675 -2.133316 3.301512
O -1.765027 -1.480869 4.502410
H 1.973517 -0.409255 4.110858
H -3.534777 -0.152988 1.117625
H -3.968582 1.609963 1.445179
H 0.734487 -1.582184 3.091476
C -1.469515 -1.338105 2.164681
C -1.098673 0.094705 2.455584
C 0.036537 0.378354 3.293487
N 0.969914 -0.596246 3.623194
H -3.875307 0.999143 -2.975504
C -5.958780 -1.049207 -0.448548
H -7.594597 0.261451 -0.973263
H -6.224940 1.578180 -2.583608
C -4.622035 -1.373425 -0.667696
C -4.441883 0.420933 -2.255972
C -5.782975 0.748413 -2.039501
C -3.838016 -0.648266 -1.579715
C 0.183259 1.672619 3.825405
H 1.075381 1.927878 4.380708
C -0.758702 2.656932 3.581091
C -1.904509 2.385143 2.842524
H -2.653061 3.153954 2.701437
C -2.099442 1.109383 2.305928
N -3.331393 0.768424 1.667027
C 1.559225 -3.559298 -0.048607
H 6.023291 2.010518 0.872604
C 1.450617 -2.335599 -0.726362
C 2.552383 -1.470112 -0.714600
C 3.735941 -1.816653 -0.062691
C 3.831808 -3.040341 0.602375
H -6.544755 -1.636079 0.253447
H -4.181361 -2.211471 -0.139805
C 0.357804 1.929871 -1.683092
C -0.553952 2.676731 -2.454585
C -1.950120 2.966663 -1.961799
C -0.129152 3.208019 -3.682859
H -0.839227 3.774150 -4.282212
C 1.177651 3.040996 -4.139222
C 2.092418 2.353342 -3.344685
H 3.126431 2.242070 -3.659581
C 1.680577 1.814361 -2.122922
H 0.515464 3.637939 0.389674
H 1.479640 3.459645 -5.094843
C 4.970155 2.226777 0.794070
C 4.128545 1.320211 1.448953
C 2.765824 1.544011 1.482897
C 2.174597 2.675868 0.863563
C 3.045678 3.621237 0.240912
C 4.416767 3.370769 0.211268
H -2.141573 -3.849112 -0.223590
H 4.545238 0.434121 1.917932
C 2.472930 4.837882 -0.440679
H 5.071890 4.069838 -0.299553
C 2.736521 -3.908251 0.613015
H 4.575519 -1.128198 -0.077253
H 4.752737 -3.316787 1.107624
H -0.600945 3.653017 3.983858
H 1.251230 -2.078801 -3.380736
H 0.877081 -0.435547 -2.866640
C -0.743591 -1.460649 -3.802012
H -2.588344 3.368921 -2.755498
H 2.421530 1.342031 -1.488414
H -2.423718 2.070694 -1.557253
H -1.925653 3.702451 -1.148286
H 1.910220 5.470639 0.258382
H 3.263180 5.453621 -0.878222
H 1.787990 4.549315 -1.247994
H 2.116295 0.835483 1.973425
H -2.517919 -2.222514 -1.992871
H -0.339813 -2.961540 -1.652061
C -6.550164 0.013674 -1.137426
H -1.734850 -2.262013 5.261837
can someone please suggest something here. Thanks in advance.
Is cutensor installed in your system? Without cutensor, the density fitting code would cause OOM sometimes.
For molecules of this size, you can just use the normal rks.RKS(mol) and its hessian implementation without using density fitting. This implementation does not have memory usage issue.