CASA-Dialogue-Act-Classifier
CASA-Dialogue-Act-Classifier copied to clipboard
training seems to be too slow?
Hi, I have a system with Razor Threadripper 24 core processor and Titan RTX card. But when I run training script it takes more than 6 seconds per iteration. Everything is setup with your default parameters. If I train it for 100 epochs it would potentially take 600 hours :)
Epoch 0: 0%| | 13/3337 [01:29<6:20:41, 6.87s/it, loss=3.736, v_num=48]
Is this normal, or there is something we can tune up to improve performance? Thanks
Hi @PolKul , Yes this is normal because for each utterance we need dialogue history hence we can't parallelize the training. Although here is Kaggle Kernel to train it on Kaggle Compute which will take around 1hr/epoch.
Another thing you can do is instead of running evaluation after each epoch you can evaluate it for smaller data and after each k(where k>1) epochs, this can be configured in
pl.Trainer line 43 main.py. You need not to worry about
100 epochsit will converge much before that and there's
EarlyStopping` in place.
Also it' takes that time much because the data is significantly large. @glicerico has trained checkpoints so if he can share, it will be very helpful. You can directly run the evaluation and/or you can tweak the training configuration manually (or hyperparameter search) and re-train it from the checkpoint, will take only few epochs to converge again.
Hope this helps.
Hi @macabdul9,
Thank you for the comments. Yes, it would really help if you could share the trained checkpoint. May I ask you to share it with me, please? Thanks.
I am uploading to dropbox to share, but it's like half a GB in size. Github has a 100MB file size limit. Do you have some place to host the checkpoint @macabdul9 ?
@PolKul here's the checkpoint for my trained model: https://www.dropbox.com/s/y42bw6qmoa9b8k2/epoch%3D29-val_accuracy%3D0.748834.ckpt?dl=0 However, I just realized that a new commit that fixes a change suggested in issue https://github.com/macabdul9/CASA-Dialogue-Act-Classifier/issues/5 to fix the class order, so I am not sure if that makes this checkpoint unusable I am not sure if this will make the above checkpoint unusable Let me know how it works for you
Yes, @glicerico it will not be useful but if you have label dictionary
for your training then it will be useful.
@glicerico, thank you for the checkpoint. But as I understand, as per the @macabdul9 comment, I cannot use it without the "label dictionary", right? If you have that dictionary, maybe you can send it as well?
Unfortunately I don't have a label dictionary. I will soon restart re-training after the Fix for issue #5
@PolKul , @macabdul9 Here's a checkpoint after the fix to issue #5
https://www.dropbox.com/s/hn3d3c273aiyymo/epoch%3D29-val_accuracy%3D0.751411.ckpt?dl=0
Still an issue for me even with Kaggle GPU's. Taking about three quarters of an epoch per hour. Was wondering if there was a checkpoint available or a way to grab fastest checkpoint for Kaggle GPU's.