visdial-challenge-starter-pytorch
visdial-challenge-starter-pytorch copied to clipboard
Training step is too slow
Hi,
Thank you for your code.
As I go deeply into this code, I found the training step is particular slow. The problem here (I guess) is the dataset construction processing, where too much functions (e.g., padding sequences, getting history) are implemented in the __get_item__
.
I wonder, have you tried to wrap these functions in the __init__
function? This might lead to more memory consuming but will absolutely accelerate the training process.
Thanks.
Hey @guoyang9, thanks for trying out the code! No, don't think we've tried moving all the padding + repackaging to init, but that should definitely lead to significant speed-up (cc @kdexd). Let us know (and please send a PR) if you happen to give that a shot.
Hi @guoyang9, thanks for trying out!
I agree, it should accelerate training, it is only a matter of design choice — I prioritized memory consumption while developing the code.
The __getitem__
execution gets parallelized across the multiple workers (per example) for reasonably small batch sizes, and the next batch is collected during the forward pass on current batch (for num_workers > 0 in the dataloader).
On the other hand, padding all the sequences with zeros would grow the memory requirements roughly in the order of the number of examples. This was also one of the motivations to do pre-processing on the fly, instead of reading tokens from H5 files (others being flexible switching of vocabulary and such).
However, this is purely my intuition and I haven't tried moving things to __init__
. I may get to this a bit later. In the mean time if you try it out yourself and it improves the speed with a decent trade-off in memory, I would encourage you to make a PR — happy to accept your contribution! :)
Hi @abhshkdz @kdexd ,thanks for your reply.
I will try to divert the padding function from __getitem__
to __init__
, and check the memory consumption.
I will get back to you later.
Thanks! :)
Interestingly, after I moved the to_indices()
and _pad_sequence()
functions into the __init__
of reader
(this could result in worse readability), there is only marginal speed improvements as I tested.
This is really a hard issue for me.
Have you guys solved where the speed bottleneck in? @abhshkdz @kdexd
Two suggestions to speed up the code and to avoid memory leak on other GPU:
- Do
torch.cuda.set_device(device)
beforetorch.cuda.empty_cache()
-
torch.cuda.empty_cache()
after every epoch instead of every batch. Had a major speedup.
Please let me know if you want me to raise a PR. Thanks.
@shubhamagarwal92 Thanks for the suggestions! Both make sense to me. If you could send in a pull request, that'd be great, thanks!