visdial-challenge-starter-pytorch icon indicating copy to clipboard operation
visdial-challenge-starter-pytorch copied to clipboard

Training step is too slow

Open guoyang9 opened this issue 4 years ago • 6 comments

Hi, Thank you for your code. As I go deeply into this code, I found the training step is particular slow. The problem here (I guess) is the dataset construction processing, where too much functions (e.g., padding sequences, getting history) are implemented in the __get_item__. I wonder, have you tried to wrap these functions in the __init__ function? This might lead to more memory consuming but will absolutely accelerate the training process. Thanks.

guoyang9 avatar Oct 14 '19 14:10 guoyang9

Hey @guoyang9, thanks for trying out the code! No, don't think we've tried moving all the padding + repackaging to init, but that should definitely lead to significant speed-up (cc @kdexd). Let us know (and please send a PR) if you happen to give that a shot.

abhshkdz avatar Oct 14 '19 17:10 abhshkdz

Hi @guoyang9, thanks for trying out!

I agree, it should accelerate training, it is only a matter of design choice — I prioritized memory consumption while developing the code.

The __getitem__ execution gets parallelized across the multiple workers (per example) for reasonably small batch sizes, and the next batch is collected during the forward pass on current batch (for num_workers > 0 in the dataloader).

On the other hand, padding all the sequences with zeros would grow the memory requirements roughly in the order of the number of examples. This was also one of the motivations to do pre-processing on the fly, instead of reading tokens from H5 files (others being flexible switching of vocabulary and such).

However, this is purely my intuition and I haven't tried moving things to __init__. I may get to this a bit later. In the mean time if you try it out yourself and it improves the speed with a decent trade-off in memory, I would encourage you to make a PR — happy to accept your contribution! :)

kdexd avatar Oct 14 '19 19:10 kdexd

Hi @abhshkdz @kdexd ,thanks for your reply. I will try to divert the padding function from __getitem__ to __init__, and check the memory consumption. I will get back to you later. Thanks! :)

guoyang9 avatar Oct 15 '19 00:10 guoyang9

Interestingly, after I moved the to_indices() and _pad_sequence() functions into the __init__ of reader (this could result in worse readability), there is only marginal speed improvements as I tested. This is really a hard issue for me. Have you guys solved where the speed bottleneck in? @abhshkdz @kdexd

guoyang9 avatar Oct 18 '19 01:10 guoyang9

Two suggestions to speed up the code and to avoid memory leak on other GPU:

  1. Do torch.cuda.set_device(device) before torch.cuda.empty_cache()
  2. torch.cuda.empty_cache() after every epoch instead of every batch. Had a major speedup.

Please let me know if you want me to raise a PR. Thanks.

shubhamagarwal92 avatar May 28 '20 00:05 shubhamagarwal92

@shubhamagarwal92 Thanks for the suggestions! Both make sense to me. If you could send in a pull request, that'd be great, thanks!

abhshkdz avatar May 28 '20 22:05 abhshkdz