kronfluence icon indicating copy to clipboard operation
kronfluence copied to clipboard

CUDA out of memory

Open MyDum-bsu opened this issue 9 months ago • 4 comments

Following openwebtext example, I several times faced errors like this.

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 784.00 MiB. GPU 2 has a total capacity of 79.15 GiB of which 97.88 MiB is free. Process 2358418 has 5.68 GiB memory in use. Process 207602 has 16.28 GiB memory in use. Process 2212797 has 16.80 GiB memory in use. Process 1399451 has 16.28 GiB memory in use. Process 2825982 has 16.80 GiB memory in use. Including non-PyTorch memory, this process has 7.18 GiB memory in use. Of the allocated memory 5.21 GiB is allocated by PyTorch, and 312.71 MiB is reserved by PyTorch but unallocated.

In different times I use different amount of NVIDIA H100 80GB (1-8). Should notice that not all of them may be fully free (but it's good enough, according to nvidia-smi). This happens both with 8b models and 124m models (but less often).

The purpose of this issue is to close the misunderstanding about the amount of memory needed in general and at certain stages of the algorithm. Thank You.

MyDum-bsu avatar Mar 31 '25 21:03 MyDum-bsu

The main issue can be considered closed, all the necessary steps are described in the documentation. But I have two new questions:

  1. In Appendix A, you talk about the block-diagonal approximation of matrices A and S, but I can't find its implementation in the code. Does it exist in this repository, and if not, how do you perform this approximation?

  2. Does the evaluation parameter query_gradient_accumulation_steps affect the accuracy or only the speed of calculations?

MyDum-bsu avatar Apr 16 '25 19:04 MyDum-bsu

Sorry for delay in my responses!

  1. Unfortunately, we don’t have block-diagonal approximation implemented. If you are running into memory issues, you could increase ‘_module_partition’ so that we iteratively fit large matrices. Implementing block diagonal approximation would require several changes to the codebase (covariance, lambda). I’m happy to give specific pointers if you are interested in this.
  2. It should not affect the accuracy (I believe there is a test for this). If you want the average influence over many queries (rather than individual queries), increasing the accumulation steps is the right thing to do.

pomonam avatar Apr 16 '25 19:04 pomonam

Thanks!

  1. For now, I will use the existing functionality, but in the future I would like to look at the work of the block-diagonal approximation. I would greatly appreciate your recommendations.

  2. However, it is important to get more accurate values for specific queries. Apparently, this parameter does not need to be enabled, I did not fully understand its meaning, even with the example from the documentation (:

  • query_gradient_accumulation_steps: Number of query gradients to accumulate over. For example, when query_gradient_accumulation_steps=2 with query_batch_size=16, a total of 32 query gradients will be stored in memory when computing dot products with training gradients.

I'll add more questions.

  1. Tell please, do the data_partitions and module_partitions parameters affect the accuracy? And in what cases is it useful to include these `aggregate_[query / train]_gradients' parameters?

  2. An article by Anthropic was published not so long ago. In it, the model is replaced with a replacement model, and linear layer gradients are also required to train it. Do you think it is possible to integrate the training of this model into the Hessian approximation stage?

MyDum-bsu avatar Apr 17 '25 06:04 MyDum-bsu

  1. If you would like to implement block-diagonal approximation, the easiest way would be to keep track of a list of covariances (e.g., https://github.com/pomonam/kronfluence/blob/main/kronfluence/module/tracked_module.py#L131), and modify the code accordingly to make use of the block-diagonal structure (doing eigendecompositions for each block instead of doing it on the full matrix) - I expect this to be a large change to the code.
  2. Hmm, that's odd. The results should be the same with different steps (although it might not be identical due to precision). Conceptually, it's trying to use a different batch size for computing the query gradient.
  3. The results should be the same; it is mostly for memory vs. compute trade-off.
  4. Sorry, I don't have much context on the article and didn't understand your question.

pomonam avatar Apr 19 '25 08:04 pomonam