meshgpt-pytorch
meshgpt-pytorch copied to clipboard
hyper-parameter suggestion
Thank you so much for the code. I am trying to work with a custom dataset of 116 meshes. I tried to run the model based on @MarcusLoppe example notebook. Although the code runs, I do not see good reconstruction. Please note that at this point, I am only trying to train the VAE and not the autoregressive model.
Perhaps, something is missing in the code, or I am making some fundamental mistake. It would be great if people who have managed to get things to run could take a look at the code.
autoencoder = MeshAutoencoder(
decoder_dims_through_depth=(128,) * 6 + (192,) * 12 + (256,) * 24 + (384,) * 6,
# codebook_size = 2048, for the 250 face dataset, more face count probably requires 16k.
# Default value is 16384
dim_codebook=192,
dim_area_embed=16,
dim_coor_embed=16,
dim_normal_embed=16,
dim_angle_embed=8,
attn_decoder_depth=4,
attn_encoder_depth=2
).to("cuda")
The main training loop uses
def train(checkpoint):
autoencoder = create_model(checkpoint)
dataset = create_dataset()
increase_dataset_size(dataset)
batch_size = 16 # The batch size should be max 64.
grad_accum_every = 4
# So set the maximal batch size (max 64) that your VRAM can handle and then use grad_accum_every to create a effective batch size of 64, e.g 16 * 4 = 64
learning_rate = 1e-3 # Start with 1e-3 then at stagnation around 0.35, you can lower it to 1e-4.
autoencoder.commit_loss_weight = 0.1 # Set dependant on the datasets size, on smaller datasets, 0.1 is fine, otherwise try from 0.25 to 0.4.
autoencoder_trainer = MeshAutoencoderTrainer(model=autoencoder, warmup_steps=10, dataset=dataset,
num_train_steps=10000,
batch_size=batch_size,
grad_accum_every=grad_accum_every,
learning_rate=learning_rate,
checkpoint_every_epoch=100,
use_wandb_tracking=False,
checkpoint_folder=f'{PROJECT_ROOT_DIR}/mesh_on_vessels/checkpoints')
autoencoder_trainer()
I observe quite a few issues that I am not able to handle.
- With
lr=1e-3
the commit loss becomes negative and even exceeds -1; I read from other issues that this problem is automatically solved once we increase the dataset size. However, I do not have more samples. Although, the functionincrease_dataset_size(dataset)
tends to increase the number of samples by a factor of 50 using the augmentation code from @MarcusLoppe . - What is the appropriate number of steps for which the model should be trained?
- In this configuration and with
lr = 1e-4
, I trained the model for 10000 steps and the loss was stuck at 1.7; Perhaps, that explains why the reconstructions are so bad. Does that mean I should just allow the model to train for longer? - Is the codebook size and decoder dims appropriate?
- I use open3d and reduce the mesh size using
simplify_quadric_decimation(800)
. Is this step a potential problem in the application?
Again, thank you so much for your time and effort. Hope people can give me some pointers to solve this issue.
Thanks, Chinmay