motion-diffusion-model
motion-diffusion-model copied to clipboard
The trained MDMs were randomly killed.
A month ago, I was able to train MDMs completely. However, a month later, my task gets killed randomly—sometimes at 10,000 steps, and sometimes at 200,000 steps. I checked and confirmed there was no memory overflow, and I haven’t made any modifications. I even re-downloaded the MDM code, but the task still gets killed randomly. Could this be a bug, or is there an issue with my machine? Could you please give me some advice?My training instruction is python -m train.train_mdm --save_dir save/my_humanml_trans_dec_bert_512 --dataset humanml --diffusion_steps 50 --arch trans_dec --text_encoder_type bert --mask_frames --use_ema