CASA-Dialogue-Act-Classifier icon indicating copy to clipboard operation
CASA-Dialogue-Act-Classifier copied to clipboard

training seems to be too slow?

Open PolKul opened this issue 4 years ago • 9 comments

Hi, I have a system with Razor Threadripper 24 core processor and Titan RTX card. But when I run training script it takes more than 6 seconds per iteration. Everything is setup with your default parameters. If I train it for 100 epochs it would potentially take 600 hours :)

Epoch 0: 0%| | 13/3337 [01:29<6:20:41, 6.87s/it, loss=3.736, v_num=48]

Is this normal, or there is something we can tune up to improve performance? Thanks

PolKul avatar Jan 27 '21 18:01 PolKul

Hi @PolKul , Yes this is normal because for each utterance we need dialogue history hence we can't parallelize the training. Although here is Kaggle Kernel to train it on Kaggle Compute which will take around 1hr/epoch. Another thing you can do is instead of running evaluation after each epoch you can evaluate it for smaller data and after each k(where k>1) epochs, this can be configured in pl.Trainer line 43 main.py. You need not to worry about 100 epochsit will converge much before that and there'sEarlyStopping` in place.

Also it' takes that time much because the data is significantly large. @glicerico has trained checkpoints so if he can share, it will be very helpful. You can directly run the evaluation and/or you can tweak the training configuration manually (or hyperparameter search) and re-train it from the checkpoint, will take only few epochs to converge again.

Hope this helps.

macabdul9 avatar Jan 27 '21 19:01 macabdul9

Hi @macabdul9,

Thank you for the comments. Yes, it would really help if you could share the trained checkpoint. May I ask you to share it with me, please? Thanks.

PolKul avatar Jan 28 '21 00:01 PolKul

I am uploading to dropbox to share, but it's like half a GB in size. Github has a 100MB file size limit. Do you have some place to host the checkpoint @macabdul9 ?

glicerico avatar Jan 28 '21 01:01 glicerico

@PolKul here's the checkpoint for my trained model: https://www.dropbox.com/s/y42bw6qmoa9b8k2/epoch%3D29-val_accuracy%3D0.748834.ckpt?dl=0 However, I just realized that a new commit that fixes a change suggested in issue https://github.com/macabdul9/CASA-Dialogue-Act-Classifier/issues/5 to fix the class order, so I am not sure if that makes this checkpoint unusable I am not sure if this will make the above checkpoint unusable Let me know how it works for you

glicerico avatar Jan 28 '21 06:01 glicerico

Yes, @glicerico it will not be useful but if you have label dictionary for your training then it will be useful.

macabdul9 avatar Jan 28 '21 08:01 macabdul9

@glicerico, thank you for the checkpoint. But as I understand, as per the @macabdul9 comment, I cannot use it without the "label dictionary", right? If you have that dictionary, maybe you can send it as well?

PolKul avatar Jan 28 '21 23:01 PolKul

Unfortunately I don't have a label dictionary. I will soon restart re-training after the Fix for issue #5

glicerico avatar Jan 28 '21 23:01 glicerico

@PolKul , @macabdul9 Here's a checkpoint after the fix to issue #5

https://www.dropbox.com/s/hn3d3c273aiyymo/epoch%3D29-val_accuracy%3D0.751411.ckpt?dl=0

glicerico avatar Jan 31 '21 21:01 glicerico

Still an issue for me even with Kaggle GPU's. Taking about three quarters of an epoch per hour. Was wondering if there was a checkpoint available or a way to grab fastest checkpoint for Kaggle GPU's.

macksjeremy avatar Mar 11 '21 02:03 macksjeremy