probability
probability copied to clipboard
Memory leak with tfb.Glow ?
Info: tf-nightly-gpu and tfp-nightly , code tested on single tesla v100, ubuntu, cuda11.0, cudnn8
RAM grows linearly when training a tfb.Glow based tfd.TransformedDistribution. It takes 16G when it starts training and growing 0.1G every 5 sec. Here's my code:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import tensorflow as tf
import tensorflow_probability as tfp
tfb = tfp.bijectors
tfd = tfp.distributions
def augument(img):
img = tf.cast(img, tf.float32) / 255.
return img
(train_data, _), (test_data, _) = tf.keras.datasets.cifar10.load_data()
train_data = tf.data.Dataset.from_tensor_slices(train_data).map(augument).batch(128)
output_shape = (32, 32 ,3)
transformed_distribution = tfd.TransformedDistribution(
distribution= tfd.MultivariateNormalDiag( tf.zeros(
(tf.math.reduce_prod(list(output_shape)), )
),
tf.ones(
(tf.math.reduce_prod(list(output_shape)), )
)
),
bijector = tfb.Glow(output_shape=output_shape,
coupling_bijector_fn=tfb.GlowDefaultNetwork,
exit_bijector_fn=tfb.GlowDefaultExitNetwork),
name='Glow_distribution')
optimizer = tf.keras.optimizers.Adam(1e-4)
@tf.function
def train_step(x):
with tf.GradientTape() as tape:
log_prob_loss = - transformed_distribution.log_prob(x)
variables = tape.watched_variables()
grads = tape.gradient(log_prob_loss, variables)
optimizer.apply_gradients(zip(grads, variables))
return tf.reduce_mean(log_prob_loss)
for epoch in range(100):
count = 0
for td in train_data:
loss = train_step(td)
print(loss, '*',end='\r')
Some log
[[5~WARNING:tensorflow:From /home/abc/anaconda3/envs/tfn/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5047: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
2020-11-24 20:40:11.190610: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:127] None of the MLIR optimization passes are enabled (registered 2)
And it takes ~3min to compile autograph.
I try same codes on a different machine (almost the same configuration and environment except for Ubuntu OS version I guess), memory is not growing that fast anymore, to some extent I can't feel it growing.
Then I modified glow.py a little bit for letting kwargs passing in realnvp block. I modified make_bijector_fnin GlowBlock and putting name_scope for bijectors in GlowBlock so that I can identify which bijector to pass which specific kwargs. And then, somehow magically it don't occupy 16G ram instead it occupies ~6000M during training. So I'll close this issue.
Wow!
As model goes bigger(image size (64,64,3) and num_glow_blocks=4) , memory leak still exist although in a slow way. And SUPER SLOW for graph compiling, takes about 10min to start training. The original setup (image size (32, 32,3), num_glow_blocks=3, and step=32) is kinda slow but use image size (64,64,3) and num_glow_blocks=4 even more slower. Please help me!
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
34953 root 20 0 126G 102G 1521M S 111. 40.7 1h43:31 python
I encounter same issue using other bijectors. It takes me 103G RAM, that's a bit insane... Any tips to fix this?
Which other bijectors did you try? I tried to reproduce the original issue, and at a first glance, the issue was that GLOW just has a gigantic graph with its default parameters. I reduced num_steps_per_block to 16 and ran it on Google Colab's GPU runtime, and didn't observe any leaks, although the graph tracing(?) and memory usage overall seemed excessive.
I agree we should make this better, somehow, but we'd first need to figure out what is actually taking up the memory.
As I mentioned above, memory leak problem solved by changing to another machine and I did not encounter this issue no more. The problem is bijectors take too much ram and slow at building graph ( takes so long to start training when model is big).
I tried realnvp with affine bijector (the one implemented in GLOW), realnvp with RationalQuadraticSpline, ffjord. They all takes so many ram while they only take about less than 9000MB GPU memory.
I have one more doubt about GLOW. Would you might answer it for me please? @SiegeLordEx I noticed the multiscale architecture (ExitBijector) is a bit different from the original proposed one in Glow.
I made a new one according to the original repo. Would you like to check if correct or not? https://gist.github.com/gitlabspy/d47fe54a931145b725c0fb5b92e24690
Could you comment on how your version is different than TFPs (on a computational level, it's certainly might arrive at the same result in a different way). In particular, we don't add randomness at each scale explicitly, but rather we add it implicitly by passing through a part of the latent vector unchanged (see the diagram here https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/bijectors/glow.py#L139). We certainly referred to the original version when implementing TFP's multiscale architecture, but perhaps we made a mistake at some point.
As for the memory usage, I don't have any progress to report just yet.
Yes, the leaving part is unchanged but it will have to obey a prior which parameterized by the staying part(normally Gaussian with mean and std returned by the staying part passing to a shallow layer). See here: https://github.com/openai/glow/blob/91b2c577a5c110b2b38761fc56d81f7d87f077c1/model.py#L546-584.
Forward pass(image to latent), leaving part leaves and can be stored as unchanged. Additionally, there’s a split prior of the leaving part (called it z2 now) at this level, normally like this z2~N(mean,std) where mean and std come from the staying part z1: z1 ->conv2d ->(mean, std). Calculating the prior.log_prob(z2)as this multiscale architecture layer’s logdet jacobian.
So for the forward pass, the difference between the tfp one and the original version is tfp doesnt calculate the logdet jacobian instead it just simply pass z2 to a 'tfb.identity'.
Inverse pass requires a upsample op since z2 has been factor-out-ed in forward pass. So the original version samples from prior to re-gain z2. However, for “real” invertible, normally people store all z2 to achieve real invertibility( see eps in split2d in link above ). For inverse pass, the only difference is in the sampling pass. Tfp assumed all latent( by all i mean latent that has been changing all the time and those have been left by exit bijector) obey same one distribution, or let’s say they are one. But in the original version they are not. Those z2 obey their own distribution and only the one staying to the end is obeying the prior we set. Which is saying the sampling process has L+1 times sampling processes and each one is sampled from different distribution.
I think the version i post in the gist follows the original version. The multiscale architecture is actually a surjector not a bijector. Using blockwise with the FactorOut i wrote results in no real invertibility(the eps part in openai’s official implementation) . It might need another parameter to tell the z passing in the bijector contants “eps” or not, I’ll refine it tomorrow.
I am also reproducing the memory leakage. Memory grows as model trains
I solved it by updating tensorflow to the last version