EnergyFlow
EnergyFlow copied to clipboard
Memory usage in the EFP batch_compute function
Hi, I'm trying to calculate the n=4, d=4 EFPs for 100k jets with 30 particles each. Every time I use the batch_compute function my program's memory usage jumps up by ~8GB even though the total output should only be of ~100MB. I was wondering if this is expected and if there is any way to lower the memory usage?
The batch_compute
method uses Python's multiprocessing
module; some increase in memory usage is expected since multiprocessing
creates new processes to carry out the computations which can involve copying arrays, etc. That said, I wouldn't necessarily expect as large of a memory jump as you're reporting. Can you tell me which Python version, EnergyFlow version, and OS you're using?
How exactly are you creating the EFPSet
that you're calling batch_compute
on? If one just does EFPSet('n==4', 'd==4')
, which is what it sounds like you might be doing, then one gets warnings that not all connected EFPs needed for the n=4, d=4 disconnected EFPs are going to be computed. Hence you should either specify you want only connected ones by adding 'p==1'
or you should use EFPSet('n<=4', 'd<=4')
. This isn't necessarily related to the memory usage, but might be relevant for your use case.
Python 3.8.1 EnergyFlow 1.3.0 Ubuntu 18.04
I should have mentioned - I do use p=1. This is the exact line:
efpset = ef.EFPSet(('n==', 4), ('d==', 4), ('p==', 1), measure='hadr', beta=1, normed=None, coords='ptyphim')
And this is a graph showing the typical memory usage of my program - each spike onset corresponds to running batch_compute
, and the drops corresponds to deleting the output of batch_compute
.
Using Python 3.8.5 on Ubuntu 20.04, if I run commands as shown below, I never experience the increased memory usage you're describing.
You said that the memory usage decreases after you delete the output of the computation, and not just when the computation finishes? That would suggest something that is not the multiprocessing module, since the processes it uses will be terminated when the computation is finished. Can you confirm that the output is an array with size (100000, 5)
?
Jut for more information, directly before/after the computation, can you run multiprocessing.get_start_method()
? EnergyFlow is designed to work with 'fork'
.
I have looked at the memory usage more carefully and I believe that the memory spike only occurs during the batch_compute
function. (I think because of the low granularity of the plot it seemed like the spike fell after the output was deleted but I have now tested with the guppy3 module before and I see that the memory usage is as expected before and after the batch_compute
call i.e. an increase by ~100MB).
The shape of the output is indeed (100000, 5)
, and the output of multiprocessing.get_start_method()
before and after the computation is fork
.
I think I finally found the issue here. This was only occurring when using a Kubernetes pod where I would be assigned a specific number of cores and amount of memory on a node. I'm guessing that by default the batch_compute
would try to use as many processes as there were total cores on the node, instead of how many I was assigned. As long as I specify efp_jobs
to be the number of cores assigned, the memory usage is normal.