meshgpt-pytorch icon indicating copy to clipboard operation
meshgpt-pytorch copied to clipboard

hyper-parameter suggestion

Open chinmay5 opened this issue 7 months ago • 20 comments

Thank you so much for the code. I am trying to work with a custom dataset of 116 meshes. I tried to run the model based on @MarcusLoppe example notebook. Although the code runs, I do not see good reconstruction. Please note that at this point, I am only trying to train the VAE and not the autoregressive model.

image

Perhaps, something is missing in the code, or I am making some fundamental mistake. It would be great if people who have managed to get things to run could take a look at the code.

autoencoder = MeshAutoencoder(
        decoder_dims_through_depth=(128,) * 6 + (192,) * 12 + (256,) * 24 + (384,) * 6,
        # codebook_size = 2048, for the 250 face dataset, more face count probably requires 16k.
        # Default value is 16384
        dim_codebook=192,
        dim_area_embed=16,
        dim_coor_embed=16,
        dim_normal_embed=16,
        dim_angle_embed=8,
        attn_decoder_depth=4,
        attn_encoder_depth=2
    ).to("cuda")

The main training loop uses

def train(checkpoint):
    autoencoder = create_model(checkpoint)
    dataset = create_dataset()
    increase_dataset_size(dataset)
    batch_size = 16  # The batch size should be max 64.
    grad_accum_every = 4
    # So set the maximal batch size (max 64) that your VRAM can handle and then use grad_accum_every to create a effective batch size of 64, e.g  16 * 4 = 64
    learning_rate = 1e-3  # Start with 1e-3 then at stagnation around 0.35, you can lower it to 1e-4.

    autoencoder.commit_loss_weight = 0.1  # Set dependant on the datasets size, on smaller datasets, 0.1 is fine, otherwise try from 0.25 to 0.4.
    autoencoder_trainer = MeshAutoencoderTrainer(model=autoencoder, warmup_steps=10, dataset=dataset,
                                                 num_train_steps=10000,
                                                 batch_size=batch_size,
                                                 grad_accum_every=grad_accum_every,
                                                 learning_rate=learning_rate,
                                                 checkpoint_every_epoch=100,
                                                 use_wandb_tracking=False,
                                                 checkpoint_folder=f'{PROJECT_ROOT_DIR}/mesh_on_vessels/checkpoints')
    autoencoder_trainer()

I observe quite a few issues that I am not able to handle.

  1. With lr=1e-3 the commit loss becomes negative and even exceeds -1; I read from other issues that this problem is automatically solved once we increase the dataset size. However, I do not have more samples. Although, the function increase_dataset_size(dataset) tends to increase the number of samples by a factor of 50 using the augmentation code from @MarcusLoppe .
  2. What is the appropriate number of steps for which the model should be trained?
  3. In this configuration and with lr = 1e-4, I trained the model for 10000 steps and the loss was stuck at 1.7; Perhaps, that explains why the reconstructions are so bad. Does that mean I should just allow the model to train for longer?
  4. Is the codebook size and decoder dims appropriate?
  5. I use open3d and reduce the mesh size using simplify_quadric_decimation(800). Is this step a potential problem in the application?

Again, thank you so much for your time and effort. Hope people can give me some pointers to solve this issue.

Thanks, Chinmay

chinmay5 avatar Jul 11 '24 14:07 chinmay5