probability icon indicating copy to clipboard operation
probability copied to clipboard

Memory leak with tfb.Glow ?

Open gitlabspy opened this issue 5 years ago • 9 comments
trafficstars

Info: tf-nightly-gpu and tfp-nightly , code tested on single tesla v100, ubuntu, cuda11.0, cudnn8 RAM grows linearly when training a tfb.Glow based tfd.TransformedDistribution. It takes 16G when it starts training and growing 0.1G every 5 sec. Here's my code:

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"  
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import tensorflow as tf
import tensorflow_probability as tfp
tfb = tfp.bijectors
tfd = tfp.distributions
def augument(img):
    img = tf.cast(img, tf.float32) / 255.
    return img

(train_data, _), (test_data, _)  = tf.keras.datasets.cifar10.load_data()
train_data = tf.data.Dataset.from_tensor_slices(train_data).map(augument).batch(128)

output_shape = (32, 32 ,3)
transformed_distribution = tfd.TransformedDistribution(
    distribution=   tfd.MultivariateNormalDiag( tf.zeros(
                                                (tf.math.reduce_prod(list(output_shape)), )
                                                ),
                                             tf.ones(
                                                (tf.math.reduce_prod(list(output_shape)), )
                                                )
                                           ),
    bijector    =   tfb.Glow(output_shape=output_shape,
                            coupling_bijector_fn=tfb.GlowDefaultNetwork,
                            exit_bijector_fn=tfb.GlowDefaultExitNetwork),
    name='Glow_distribution')
optimizer = tf.keras.optimizers.Adam(1e-4)

@tf.function
def train_step(x):
    with tf.GradientTape() as tape:
        log_prob_loss = - transformed_distribution.log_prob(x) 
    variables = tape.watched_variables()
    grads = tape.gradient(log_prob_loss, variables)
    optimizer.apply_gradients(zip(grads, variables))
    return tf.reduce_mean(log_prob_loss)
for epoch in range(100):
    count = 0
    for td in train_data:
        loss = train_step(td)
        print(loss, '*',end='\r')

Some log

[[5~WARNING:tensorflow:From /home/abc/anaconda3/envs/tfn/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5047: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
2020-11-24 20:40:11.190610: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:127] None of the MLIR optimization passes are enabled (registered 2)

And it takes ~3min to compile autograph.

gitlabspy avatar Nov 24 '20 08:11 gitlabspy

I try same codes on a different machine (almost the same configuration and environment except for Ubuntu OS version I guess), memory is not growing that fast anymore, to some extent I can't feel it growing.

Then I modified glow.py a little bit for letting kwargs passing in realnvp block. I modified make_bijector_fnin GlowBlock and putting name_scope for bijectors in GlowBlock so that I can identify which bijector to pass which specific kwargs. And then, somehow magically it don't occupy 16G ram instead it occupies ~6000M during training. So I'll close this issue.

Wow!

gitlabspy avatar Nov 30 '20 09:11 gitlabspy

As model goes bigger(image size (64,64,3) and num_glow_blocks=4) , memory leak still exist although in a slow way. And SUPER SLOW for graph compiling, takes about 10min to start training. The original setup (image size (32, 32,3), num_glow_blocks=3, and step=32) is kinda slow but use image size (64,64,3) and num_glow_blocks=4 even more slower. Please help me!

gitlabspy avatar Dec 07 '20 04:12 gitlabspy

PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command                                                                               
34953 root        20   0  126G  102G 1521M S 111. 40.7  1h43:31 python

I encounter same issue using other bijectors. It takes me 103G RAM, that's a bit insane... Any tips to fix this?

gitlabspy avatar Dec 27 '20 12:12 gitlabspy

Which other bijectors did you try? I tried to reproduce the original issue, and at a first glance, the issue was that GLOW just has a gigantic graph with its default parameters. I reduced num_steps_per_block to 16 and ran it on Google Colab's GPU runtime, and didn't observe any leaks, although the graph tracing(?) and memory usage overall seemed excessive.

I agree we should make this better, somehow, but we'd first need to figure out what is actually taking up the memory.

SiegeLordEx avatar Dec 28 '20 20:12 SiegeLordEx

As I mentioned above, memory leak problem solved by changing to another machine and I did not encounter this issue no more. The problem is bijectors take too much ram and slow at building graph ( takes so long to start training when model is big). I tried realnvp with affine bijector (the one implemented in GLOW), realnvp with RationalQuadraticSpline, ffjord. They all takes so many ram while they only take about less than 9000MB GPU memory.

gitlabspy avatar Dec 29 '20 04:12 gitlabspy

I have one more doubt about GLOW. Would you might answer it for me please? @SiegeLordEx I noticed the multiscale architecture (ExitBijector) is a bit different from the original proposed one in Glow.

I made a new one according to the original repo. Would you like to check if correct or not? https://gist.github.com/gitlabspy/d47fe54a931145b725c0fb5b92e24690

gitlabspy avatar Jan 05 '21 07:01 gitlabspy

Could you comment on how your version is different than TFPs (on a computational level, it's certainly might arrive at the same result in a different way). In particular, we don't add randomness at each scale explicitly, but rather we add it implicitly by passing through a part of the latent vector unchanged (see the diagram here https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/bijectors/glow.py#L139). We certainly referred to the original version when implementing TFP's multiscale architecture, but perhaps we made a mistake at some point.

As for the memory usage, I don't have any progress to report just yet.

SiegeLordEx avatar Jan 08 '21 22:01 SiegeLordEx

Yes, the leaving part is unchanged but it will have to obey a prior which parameterized by the staying part(normally Gaussian with mean and std returned by the staying part passing to a shallow layer). See here: https://github.com/openai/glow/blob/91b2c577a5c110b2b38761fc56d81f7d87f077c1/model.py#L546-584.

Forward pass(image to latent), leaving part leaves and can be stored as unchanged. Additionally, there’s a split prior of the leaving part (called it z2 now) at this level, normally like this z2~N(mean,std) where mean and std come from the staying part z1: z1 ->conv2d ->(mean, std). Calculating the prior.log_prob(z2)as this multiscale architecture layer’s logdet jacobian. So for the forward pass, the difference between the tfp one and the original version is tfp doesnt calculate the logdet jacobian instead it just simply pass z2 to a 'tfb.identity'.

Inverse pass requires a upsample op since z2 has been factor-out-ed in forward pass. So the original version samples from prior to re-gain z2. However, for “real” invertible, normally people store all z2 to achieve real invertibility( see eps in split2d in link above ). For inverse pass, the only difference is in the sampling pass. Tfp assumed all latent( by all i mean latent that has been changing all the time and those have been left by exit bijector) obey same one distribution, or let’s say they are one. But in the original version they are not. Those z2 obey their own distribution and only the one staying to the end is obeying the prior we set. Which is saying the sampling process has L+1 times sampling processes and each one is sampled from different distribution.

I think the version i post in the gist follows the original version. The multiscale architecture is actually a surjector not a bijector. Using blockwise with the FactorOut i wrote results in no real invertibility(the eps part in openai’s official implementation) . It might need another parameter to tell the z passing in the bijector contants “eps” or not, I’ll refine it tomorrow.

gitlabspy avatar Jan 09 '21 18:01 gitlabspy

I am also reproducing the memory leakage. Memory grows as model trains

ivallesp avatar Jan 21 '22 17:01 ivallesp

I solved it by updating tensorflow to the last version

ivallesp avatar Sep 01 '22 00:09 ivallesp