torch-gqn
torch-gqn copied to clipboard
How long does training usually take
Whether I use 2xNVIDIA GeForce RTX 2080 Ti
or 4xTeslaV100-SXM2-32GB
or 8xTeslaV100-SXM2-32GB
, it takes about 20 days to train shepard_metzler_5_parts
. I want to know how long it usually takes to train for shepard_metzler_5_parts
.
[Edit: Training the model on the Shepard Metzler dataset for 300k iterations should be enough. These are the performances I remember getting:
- 12 layers and
shared_core=false
:- +/- 3it/s with an rtx 3090
- +/- 5it/s with a tesla v100 (google colab)
- It took about 24 hours to train the model for 300k steps with 12 layers and
shared_core=true
with an rtx 2080, observed similar speeds with an rtx 3070. ]
There is a small mistake in almost all implementations of the GQN model which causes the model to have an incredible amount of parameters. You'll see a shared_core
parameter (in the GQN constructor iirc) and I would suggest setting it to true. Setting it to false causes it to make a separate VAE for each time-step, which increases the number of parameters dramatically, and consequently the training time.
This was not how the DRAW (and ConvDRAW) model was designed (the GQN generator is a ConvDRAW model). DRAW is a recurrent model that generates images in a fixed number of steps, and requires the hidden state of the previous time-step as input. Setting shared_core
to false creates 1 ConvDRAW generator and uses it recurrently, as intended. Setting it to true just makes a chain of VAE models, which is not a ConvDRAW model.
I am not sure if this was intended or not by the repository authors.
Note: @lihao11 My thesis was about this model, and made a modification where it's possible to create a multi-layer GQN generator, where each layer acts as a proper RRN. It's also possible to set a separate number of time-steps per layer and a resolution scaling (i.e. layer 1 generates low res, layer 2 doubles resolution, ...). If you need it, I can ask my uni if I'm allowed to make the code public.