s-jSDM icon indicating copy to clipboard operation
s-jSDM copied to clipboard

sjSDM - memory issues of anova()

Open florianhartig opened this issue 1 year ago • 6 comments

via Email:

I'm trying to use an anova on a sjSDM as shown on the github page

an = anova(model)

Important to note that I'm using a remote server, and the model was computed on a GPU. Here is the error I get :

Error in py_call_impl(callable, call_args$unnamed, call_args$named) : torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 59.78 GiB. GPU 0 has a total capacity of 44.35 GiB of which 43.83 GiB is free. Including non-PyTorch memory, this process has 526.00 MiB memory in use. Of the allocated memory 179.71 MiB is allocated by PyTorch, and 24.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I opened the terminal and tried :

set 'PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True'

But it didn't change anything. The GPU has a 48GB size.

florianhartig avatar Apr 17 '24 08:04 florianhartig

Hi,

The model tried to allocate 60GB, but your GPU has only 48GB. Possible solutions to reduce the memory consumption of your model

  • Decrease the step_size
  • Decrease sampling

What are the dimensions of your data? And could you please share your code? There may be another problem.

Best, Max

MaximilianPi avatar Apr 17 '24 08:04 MaximilianPi

Hi, thanks for the quick answer, the data comprises 31,000 rows, 2,000 species, and I only included 3 variables to check how the models would run with a smaller set of variables. It ran in 7 minutes. Here is the code I used :

model <- sjSDM(Y = RLS_matrix,
               env = linear(data = as.matrix(all_cov_selection), 
                         formula = ~ Sand + Rock + seagrass), 
               spatial = linear(data = as.matrix(coords), 
                             formula = ~ 0 + longitude:latitude), 
               se = F, 
               family = binomial("logit"), 
               sampling = 100L,
               device = "gpu")

Thanks

Loic-sanchez avatar Apr 17 '24 11:04 Loic-sanchez

step_size is automatically set to 10% of your data (which can consume a lot of memory for large data), so you might want to try it for your full data:

model <- sjSDM(Y = RLS_matrix,
               env = linear(data = as.matrix(all_cov_selection), 
                         formula = ~ .), 
               spatial = linear(data = as.matrix(coords), 
                             formula = ~ 0 + longitude:latitude), 
               se = F, 
               step_size = 100L,
               family = binomial("logit"), 
               sampling = 100L,
               device = "gpu")

MaximilianPi avatar Apr 17 '24 12:04 MaximilianPi

Thanks, it worked, no error at the end of the command, however when I ran

an = anova(model)

Although the progress bar gets to the end in 7 minutes, the process never ends and the 'an' object never appears in the environment. I had the same issue when I ran the model on my own machine (with device = "cpu"), the anova took hours to run, and in the end the object was not created... any clues ?

Loic-sanchez avatar Apr 18 '24 09:04 Loic-sanchez

Hi @Loic-sanchez , the anova can be very slow, the default for the MC samples in the anova is 5000 (samples = 5000L), maybe rerun it with only 100-500 samples: an = anova(model, samples = 500L)

MaximilianPi avatar Apr 18 '24 13:04 MaximilianPi

Hello, the same thing happens: the anova runs quickly (~7 minutes), but the process never seems to end, and no object is created.

I ran the function with 500 and 100 samples

Loic-sanchez avatar Apr 22 '24 09:04 Loic-sanchez