BayesFlow icon indicating copy to clipboard operation
BayesFlow copied to clipboard

OOM after ~ 50 epochs

Open sboehringer opened this issue 1 year ago • 10 comments

When running BayesFlow to analyze regression models, I get OOM errors after about 50-60 epochs. The simulations need a data set of covariates from which a one-dimensional outcome is simulated. Here are my batched Prior/Simulation classes

class MyPrior:
	def __init__(self, prior):
		self.mu = prior['mu']
		self.sigma = prior['sigma']
		self.rng = np.random.default_rng().normal
	def single(self):
		return self.rng(self.mu, scale = self.sigma)
	def batch(self, N):
		return [self.single() for i in range(0, N)]
	def __call__(self, batch_size):
		pars = self.batch(batch_size)
		return np.array(pars)

class RegressionSimulator:
	def __init__(self, dCov, model):
		self.dCov = dCov
		self.model = model
	def single(self, par):
		d = simulateOutcome(par, self.dCov, self.model)
		return d.reshape([d.shape[0], 1])
	def batch(self, par):
		return [self.single(p) for p in par]
	def __call__(self, par, *args):
		return np.array(self.batch(par))

The simulation is set up as:

def doRegress(o, a,
	# pass-through args
	model, N, pars, Par,
	Nepochs, Nbatch, NsumD, Nit, Nval, Nho, Npost, weight,
	post_height, post_alpha, post_color):

	Log(2, Sprintf('Running regression model: %{regress}s, sample size N=%{N}d', o));

	# <p> simulate real data set
	d = simulateFromSpec(model, N = int(N))

	rSim = RegressionSimulator(d['dCov'], model['outcome']) 
	simulator = bf.simulation.Simulator(rSim)

	prior = bf.simulation.Prior(MyPrior(model['prior']));
	generative_model = bf.simulation.GenerativeModel(prior, simulator, simulator_is_batched = True)

	summary_net = bf.networks.SetTransformer(input_dim = NsumD)
	inference_net = bf.networks.InvertibleNetwork(num_params = len(Par))
	amortized_posterior = bf.amortizers.AmortizedPosterior(inference_net, summary_net)

	trainer = bf.trainers.Trainer(amortizer=amortized_posterior, generative_model=generative_model)
	losses = trainer.train_online(epochs=Nepochs, iterations_per_epoch=Nit, batch_size=Nbatch, validation_sims=Nval)

This is on a NVIDIA GeForce RTX 3070 with 8Gb of RAM. The sample is ~300 with up to 4 covariates, thus a small data set.

Thank you.

sboehringer avatar Apr 15 '24 13:04 sboehringer

Can you please paste the error stack here to investigate? Do you run out of GPU memory or RAM?

stefanradev93 avatar Apr 15 '24 13:04 stefanradev93

It is GPU RAM. Please find the log below (wrapped as captured from a tmux session). I can also provide a self-contained example for reproduction, if helpful.

debug-log-20240415.txt

sboehringer avatar Apr 15 '24 13:04 sboehringer

I should add that the suggested fix from the output (TF_GPU_ALLOCATOR=cuda_malloc_async) did not change anything.

sboehringer avatar Apr 15 '24 13:04 sboehringer

Given that the errors occurs only after 50 or 60 epochs, it looks as if the memory accumulates somehow, though I don't see why or where this might be. Does the code run fast enough to test it without a GPU? If it does, could you run it on CPU only (using $ CUDA_VISIBLE_DEVICES='' ./bayesFlow.py --regress linearMVmi) and monitor RAM usage to see if the memory usage increases with epochs? Does it run out of memory in that case as well? This would help to pinpoint whether the error is GPU related or more general.

If this is not possible, a self-contained example for reproduction would be a great help.

vpratz avatar Apr 15 '24 17:04 vpratz

Are you running the training from a script or from a Jupyter notebook?

stefanradev93 avatar Apr 15 '24 17:04 stefanradev93

Are you running the training from a script or from a Jupyter notebook?

The command is at the top of the log, it is a training script

vpratz avatar Apr 15 '24 17:04 vpratz

I see. I would need to run it on my GPU workstation to reproduce the problem.

stefanradev93 avatar Apr 15 '24 17:04 stefanradev93

@vpratz: here is memory usage using the CPU

End Epoch 4: 2.564g
Beginning Epoch 5: 2.682g
~ 100/1000 Iterations: 2.690g
End Epoch 5: 2.693g

Memory from the previous epoch seems not to be freed/reused. Then there is some further initial consumption of 8Mb after which memory usage becomes stable. This pattern seems to repeat for ensuing epochs.

These numbers do seem to be compatible with what I see under the GPU as 7.5G/130M would equate to ~ 57 epochs until OOM.

I will put together a self-contained example for reproduction. Thank you.

sboehringer avatar Apr 17 '24 08:04 sboehringer

Here is an example that has been tested for OOM. bayesFlow-debug.txt

It can be run straight without arguments. It shouldn't touch the disk (read/write).

sboehringer avatar Apr 17 '24 10:04 sboehringer

Thanks, I will investigate if that's a problem that specifically affects the SetTransformer!

stefanradev93 avatar Apr 17 '24 13:04 stefanradev93