PQCache GPU memory usage seems abnormal

In paper, 4.1.4 Hardware Environment and Hyperparameters, Unless otherwise specified, for each experiment, we use an NVIDIA GeForce RTX 4090 24GB card for GPU computation, two Intel(R) Xeon(R) Gold 6330 CPUs for K-Means clustering, 500GB CPU memory, and PCI-e 1.0 (x16) for communication.

my hardware environment: 2 RTX 3090, 128G cpu memory, when I run vq_pred.py, it requires the use of two GPUs, and the memory usage of each GPU exceeds 23GB。 Do you know the reason?

Jul 12 '25 03:07 wagnzi

I use llama-3.1-8B, llama-3.2-1B, The same problem has occurred.

Jul 12 '25 05:07 wagnzi

By default our script use 2 GPU to do evaluations, but you could actually modify this line to use single GPU. https://github.com/HugoZHL/PQCache/blob/778c904e16eb577fb37b94b5a714b7f39f7db91d/run_llama.sh#L8

The memory usage issue can be attributed to PyTorch memory caching mechanism, which trys to recycle those expired memory space instead of immediately invoking cudaFree.

Jul 16 '25 07:07 BirdChristopher