MotionGPT VAE - Training

I'm having a hard time recreating your results. I'm trying to retrain the VAE from scratch: LR 1e-4, Adam, 512 embedding size. The validation error seems to be leveling off, and I don't think brute forcing to 150K epochs would solve this issue.

Would it be possible to share your loss function curves?

Screenshot 2024-05-27 at 8 21 32 AM

May 18 '24 16:05 palmex

Need to turn debug mode off

Jun 26 '24 23:06 palmex

I meet the same problem. Could you please share the correct loss curve?

Dec 10 '24 05:12 aixiaodewugege

My problem ended up being that there is a debug flag which the author's used for iterating quickly. The debug mode only loads ~100 examples. Also, the batch size needs to be increased to 256 for VAE. Make sure you are using stage1 config.

In the dataloader, there is also a caching function which checks for the existence of a ./tmp folder and loads a .pkl file instead of loading in the entire dataset again.

Screenshot 2024-12-10 at 11 16 36 AM

Dec 10 '24 19:12 palmex

Thanks! It really helps! But my commit loss looks strange. Does yours look the same?

Dec 11 '24 03:12 aixiaodewugege

Yes

Dec 11 '24 17:12 palmex

@palmex @aixiaodewugege I am so sorry about your troubles caused by "this debug flag". If you have any other questions, feel free to ask. I'm more than happy to help.

Dec 12 '24 00:12 ChenFengYe

@palmex @aixiaodewugege I am so sorry about your troubles caused by "this debug flag". If you have any other questions, feel free to ask. I'm more than happy to help.

"Hello, I'm a newcomer to VQ-VAEs and am currently working on training a face motion VQ-VAE using your code. However, I've noticed that the validation loss curve appears to be indicating overfitting. Could you please offer some advice to address this issue?"

Dec 12 '24 02:12 aixiaodewugege

@aixiaodewugege Based on the loss-feature/train and loss-feature/val curves, there is clear overfitting. The commit loss reflects the efficiency of the codebook, which seems somewhat correct.

One question could help: how does this feature differ from the parameters of FLAME(I guess flame indicates the parameters)? A general solution to address this problem is to apply data augmentation and use dropout with masking on both input features and loss during training.

Dec 13 '24 19:12 ChenFengYe

@ChenFengYe Thank you for your reply! It was really helpful. The FLAME model represents the vertex loss. After adding dropout, the results look better, but I still notice a gap between the training and validation loss. How can I minimize this gap?

Additionally, could you clarify what you mean by data augmentation?

Dec 17 '24 05:12 aixiaodewugege

@aixiaodewugege Data augmentation could help to minimize this gap. The specific process could be, for example, RTS (rotation, translation, scale) on training data (like vertex, XYZ). You can augment your data by following your target tasks.

Dec 18 '24 02:12 ChenFengYe