Reuses the J variable in optimizers to reduce the amount of Jacobian-sized matrices held in memory. This can lead to tangible memory decreases (at L=M=N=16 fixed bdry solve with default grids, J is 4GB) which can be observed after the significant memory jump after the first Jacobian calculation.

Updates docs to give better memory usage insight
Fixes some benchmark run conditions. This should make the memory profiler run with fork pull requests, also run benchmark when run_benchmarks label is added.

Apr 14 '25 21:04 dpanici

Some vram profiling on my laptop GPU with 12GB memory. The used script:

import sys
import os

sys.path.insert(0, os.path.abspath("."))
sys.path.append(os.path.abspath("../../"))


os.environ["XLA_PYTHON_CLIENT_ALLOCATOR"] = "platform"

from desc import set_device

set_device("gpu")

from desc.backend import print_backend_info
from desc.examples import get
from desc.objectives import ObjectiveFunction, ForceBalance

print_backend_info()

N = 14

eq = get("precise_QA")
eq.change_resolution(L=N, M=N, N=N, L_grid=2 * N, M_grid=2 * N, N_grid=2 * N)
eq.resolution_summary()
eq.set_initial_guess()
obj = ObjectiveFunction(ForceBalance(eq), jac_chunk_size=500, deriv_mode="batched")
obj.build()
print(f"Objective function deriv mode: {obj._deriv_mode}")
print(f"Objective function chunk size: {obj._jac_chunk_size}")

# Jacobian size is 4375x25650, which is ~0.8 GB

eq.solve(
    objective=obj,
    constraints=None,
    optimizer="lsq-exact",
    ftol=1e-4,
    xtol=1e-6,
    gtol=1e-6,
    maxiter=5,
    x_scale="auto",
    verbose=2,
    copy=False,
)

Master

test-case-res14-master

PR

For clarity, I deleted the previous result with only J=J*d trick.

test-case-res14-with1688-delQR

mem analysis

Apr 14 '25 21:04 YigitElma

| -------------------------------------- test_build_transform_fft_lowres test_equilibrium_init_medres test_equilibrium_init_highres test_objective_compile_dshape_current | test_objective_compute_dshape_current | test_objective_jac_dshape_current test_perturb_2 test_proximal_jac_atf_with_eq_update test_proximal_freeb_jac test_solve_fixed_iter_compiled test_LinearConstraintProjection_build | test_objective_compute_ripple_spline test_objective_grad_ripple_spline test_build_transform_fft_midres test_build_transform_fft_highres test_equilibrium_init_lowres test_objective_compile_atf test_objective_compute_atf test_objective_jac_atf test_perturb_1 test_proximal_jac_atf test_proximal_freeb_compute test_solve_fixed_iter test_objective_compute_ripple test_objective_grad_ripple

Apr 14 '25 22:04 github-actions[bot]

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 95.79%. Comparing base (f580c7a) to head (7a461e2). :warning: Report is 273 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1688      +/-   ##
==========================================
+ Coverage   95.67%   95.79%   +0.12%     
==========================================
  Files         101      101              
  Lines       26731    26753      +22     
==========================================
+ Hits        25575    25629      +54     
+ Misses       1156     1124      -32

Files with missing lines	Coverage Δ
desc/optimize/aug_lagrangian.py	`97.05% <100.00%> (+0.07%)`	:arrow_up:
desc/optimize/aug_lagrangian_ls.py	`95.83% <100.00%> (+0.09%)`	:arrow_up:
desc/optimize/fmin_scalar.py	`98.18% <100.00%> (+0.06%)`	:arrow_up:
desc/optimize/least_squares.py	`99.36% <100.00%> (+0.02%)`	:arrow_up:

... and 3 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Apr 14 '25 23:04 codecov[bot]

To be clear though, this doesn't resolve the memory leak issue #1686 but prevents storing big arrays at the same time.

Apr 14 '25 23:04 YigitElma

Yeah I don't see how this actually fixes the underlying issue in #1686. From the plots above it looks like this doesn't get rid of those spikes which are likely the cause of the OOM after a few iterations, we should figure out what's actually causing that.

Apr 15 '25 01:04 f0uriest

Also the plots only show a reduction of ~500 MB, shouldn't it be a lot more?

Apr 15 '25 01:04 f0uriest

@f0uriest The last plot is with the most recent changes, and there, the memory decrease is ~1.8GB. The peaks are the Jacobian evaluations, and those can be flattened out by jac_chunk_size=1 (for my test case it was 500).

I think #1686 was at the edge of the memory limit and some small leak like 20-40 MB according to @dpanici's further analysis caused the OOM. We realized these changes while looking at that issue but this PR is not for solving that issue.

Apr 15 '25 03:04 YigitElma

Do it in place garbage collection manually in btwn steps

Apr 16 '25 19:04 dpanici

check for fmintr as well

Apr 16 '25 19:04 dpanici

I shared some profiling https://github.com/PlasmaControl/DESC/issues/1686#issuecomment-2814372946 https://github.com/PlasmaControl/DESC/issues/1686#issuecomment-2814409983

Apr 18 '25 03:04 YigitElma

I ran the profiling again before and after Garbage collection, it looks like during the optimization there is no change (the difference is almost constant with memory difference at time t=0). And, the at[].set() doesn't change anything, I think that is never in place (except jit), see https://docs.jax.dev/en/latest/_autosummary/jax.numpy.ndarray.at.html.

It looks like for lower resolutions garbage collection has some speed effect, maybe we can just run it once before optimization? test-case-memory-gpucpu-bagc

Note: I should mention again, this is done on my personal laptop. Although GPU profiling is pretty isolated (normal apps don't use NVidia Vram), CPU memory is exposed to other apps.

Apr 18 '25 03:04 YigitElma

Also, just for record here are the profiling scripts.

Profiling.zip

Apr 18 '25 03:04 YigitElma

Yea maybe we don't gc if it seems to have negligible effect. The in-place stuff if it does not affect speed, we can keep in and then when #1669 is eventually done, we will actually see the benefits then.

Apr 18 '25 19:04 dpanici

Memory benchmark result

|               Test Name                |      %Δ      |    Master (MB)     |      PR (MB)       |    Δ (MB)    |    Time PR (s)     |  Time Master (s)   |
| -------------------------------------- | ------------ | ------------------ | ------------------ | ------------ | ------------------ | ------------------ |
  test_objective_jac_w7x                 |    8.31 %    |     3.762e+03      |     4.075e+03      |    312.57    |       29.94        |       28.52        |
  test_proximal_jac_w7x_with_eq_update   |   -0.97 %    |     6.925e+03      |     6.857e+03      |    -67.33    |       172.99       |       170.98       |
  test_proximal_freeb_jac                |    0.06 %    |     1.317e+04      |     1.318e+04      |     7.62     |       71.69        |       74.39        |
  test_proximal_freeb_jac_blocked        |   -0.41 %    |     1.320e+04      |     1.314e+04      |    -54.11    |       70.18        |       69.37        |
  test_proximal_freeb_jac_batched        |   -0.97 %    |     7.558e+03      |     7.485e+03      |    -73.59    |       107.44       |       106.93       |
  test_proximal_jac_ripple               |    0.22 %    |     7.286e+03      |     7.303e+03      |    16.30     |       68.49        |       69.02        |
  test_proximal_jac_ripple_spline        |   -0.17 %    |     3.826e+03      |     3.820e+03      |    -6.64     |       75.41        |       75.93        |
+ test_eq_solve                          |   -11.84 %   |     2.249e+03      |     1.983e+03      |   -266.44    |       122.44       |       122.35       |

For the memory plots, go to the summary of Memory Benchmarks workflow and download the artifact.

May 01 '25 03:05 github-actions[bot]

Memory benchmark result

|               Test Name                |      %Δ      |    Master (MB)     |      PR (MB)       |    Δ (MB)    |    Time PR (s)     |  Time Master (s)   |
| -------------------------------------- | ------------ | ------------------ | ------------------ | ------------ | ------------------ | ------------------ |
  test_objective_jac_w7x                 |   -0.15 %    |     3.945e+03      |     3.939e+03      |    -6.04     |       31.84        |       28.83        |
  test_proximal_jac_w7x_with_eq_update   |   -0.34 %    |     6.902e+03      |     6.879e+03      |    -23.60    |       174.22       |       172.89       |
  test_proximal_freeb_jac                |    0.02 %    |     1.317e+04      |     1.318e+04      |     3.01     |       71.00        |       70.69        |
  test_proximal_freeb_jac_blocked        |    0.54 %    |     1.319e+04      |     1.327e+04      |    71.57     |       69.99        |       71.58        |
  test_proximal_freeb_jac_batched        |   -2.12 %    |     7.644e+03      |     7.482e+03      |   -161.99    |       109.36       |       108.75       |
  test_proximal_jac_ripple               |   -0.07 %    |     7.328e+03      |     7.323e+03      |    -5.43     |       69.62        |       69.91        |
  test_proximal_jac_ripple_spline        |    2.08 %    |     3.831e+03      |     3.911e+03      |    79.84     |       77.24        |       77.45        |

For the memory plots, go to the summary of Memory Benchmarks workflow and download the artifact.

These benchmarks don't have any optimization, so this shouldn't have any effect. If you want, I can add an optimization test with only 1 step.

May 01 '25 03:05 YigitElma

Memory benchmark result

|               Test Name                |      %Δ      |    Master (MB)     |      PR (MB)       |    Δ (MB)    |    Time PR (s)     |  Time Master (s)   |
| -------------------------------------- | ------------ | ------------------ | ------------------ | ------------ | ------------------ | ------------------ |
  test_objective_jac_w7x                 |   -0.15 %    |     3.945e+03      |     3.939e+03      |    -6.04     |       31.84        |       28.83        |
  test_proximal_jac_w7x_with_eq_update   |   -0.34 %    |     6.902e+03      |     6.879e+03      |    -23.60    |       174.22       |       172.89       |
  test_proximal_freeb_jac                |    0.02 %    |     1.317e+04      |     1.318e+04      |     3.01     |       71.00        |       70.69        |
  test_proximal_freeb_jac_blocked        |    0.54 %    |     1.319e+04      |     1.327e+04      |    71.57     |       69.99        |       71.58        |
  test_proximal_freeb_jac_batched        |   -2.12 %    |     7.644e+03      |     7.482e+03      |   -161.99    |       109.36       |       108.75       |
  test_proximal_jac_ripple               |   -0.07 %    |     7.328e+03      |     7.323e+03      |    -5.43     |       69.62        |       69.91        |
  test_proximal_jac_ripple_spline        |    2.08 %    |     3.831e+03      |     3.911e+03      |    79.84     |       77.24        |       77.45        |

For the memory plots, go to the summary of Memory Benchmarks workflow and download the artifact.

These benchmarks don't have any optimization, so this shouldn't have any effect. If you want, I can add an optimization test with only 1 step.

Yea that might be useful. Just for lsq exact I'd say

May 01 '25 13:05 dpanici

I will add a small tip to decide what jac chunk size will make a difference. One can use the stdout at the beginning of the oprimization Number of parameters: .... to decide min chunk size that will help reducing memory

May 07 '25 14:05 YigitElma

@ddudt @rahulgaur104

May 08 '25 02:05 YigitElma

I approve but I wrote half so can't actually approve

Me the same @f0uriest @ddudt

May 12 '25 20:05 YigitElma

DESC
DESC copied to clipboard

Reduce storage of big arrays in memory during optimization

Master

PR

Codecov Report

Memory benchmark result

Memory benchmark result

Memory benchmark result

DESC DESC copied to clipboard

Reduce storage of big arrays in memory during optimization

Master

PR

Codecov Report

Memory benchmark result

Memory benchmark result

Memory benchmark result

DESC
DESC copied to clipboard