GPU memory usage seems abnormal
In paper, 4.1.4 Hardware Environment and Hyperparameters, Unless otherwise specified, for each experiment, we use an NVIDIA GeForce RTX 4090 24GB card for GPU computation, two Intel(R) Xeon(R) Gold 6330 CPUs for K-Means clustering, 500GB CPU memory, and PCI-e 1.0 (x16) for communication.
my hardware environment: 2 RTX 3090, 128G cpu memory, when I run vq_pred.py, it requires the use of two GPUs, and the memory usage of each GPU exceeds 23GB。 Do you know the reason?
I use llama-3.1-8B, llama-3.2-1B, The same problem has occurred.
By default our script use 2 GPU to do evaluations, but you could actually modify this line to use single GPU. https://github.com/HugoZHL/PQCache/blob/778c904e16eb577fb37b94b5a714b7f39f7db91d/run_llama.sh#L8
The memory usage issue can be attributed to PyTorch memory caching mechanism, which trys to recycle those expired memory space instead of immediately invoking cudaFree.