CorefQA
CorefQA copied to clipboard
Possible for Training on multiple GPUs?
Hi. thanks for your contribution. Very great and novel work. I have also implemented your model using PyTorch. But I find it impossible to train even a base model on multiple GPUs. The main reason I believe is that even when batch size is 1, the number of generated questions varies and it is impossible to distribute those questions together with the corresponding passages to different GPUS in the interval of computation. Have you trained this model on multiple GPUs before or is it only feasible to train on TPUs?
Hope you could clarify my confusion and correct me if I am wrong.
Thanks
Hello ! thanks for the comment. I think the bottle neck is that we have to treat a single document as the basic unit (a mention at the beginning of the doc can be coreferent to another mention resided in the end of the doc). BERT is super length sensitive. That is reason why that we need 128G memory (we do know that it is very computational inefficient).
To get the model run on 16G* (1-4) gpus, I think chopping docs is essencial (but of course you sacrifice the performance). Or perhaps just ignore super long documents.
We plan to release the pytorch version in the near future. Since it is super clumsy to get pytorch run on TPUs, it will be on GPUs. But not (span)BERT-large for sure (You need 2 V100 to get bert-large run even on pretty short sequences).
Thanks for your reply. I agree that randomly truncating the whole document into several consecutive sentences is essential for training, just like what Joshi has done on his BERT baseline models.
It may be feasible to chop generated questions into several chunks, for example, if we get 50 questions, and we can use a for loop to do prediction, and each time we just feed 8 or lesser questions into the model. And we can also distribute these chunks into different GPUs to speed up training.
But I believe the above two methods will inevitably hurt final performance.
the backward QA is very memory intensive. Let's say you first propose mention x, and gets qa scores for span y, denoted by s(y|x). Then you need to compute s(x|y), using y as queries. If we perseve C ys, then the memory is C times as large.
I guess the backward QA is the most memory intensive, more intensive than all the other options. The backward QA introduces about 1 - 1.4 F1 boost, but consumes times larger memory. Remvoing the backward QA is also an option.
re: if we get 50 questions, and we can use a for loop to do prediction, and each time we just feed 8 or lesser questions into the model. And we can also distribute these chunks into different GPUs to speed up training.
This is a great suggestion. The problem is that as long as your 4-8 V100 cannot fit all questions(say 150 questions) proposed in the same doc, it will be problematic. Since you need to update your gradients after the current round and release the memory, it makes mentions proposed with the updated model different from the firstly proposed 150 mentions.
But yes, we cannot do some tradeoffs.
I don't like the inefficiency of the model in the current paper either. We will try fix them in the future work. Nice to chat :-)
Yes, the backward coreference score consumes far more memory than the forward one. I also notice that you just use the sentence that mention x resides in as the context during backward QA. Is this just a tradeoff for saving memory usage or the information provided by this sentence is enough for predicting the backward score for x and y? Because mention x and y have already been identified unlike in the forward one, where we have to find potential antecedents ys and the whole document is necessary.
Another thing is that I didn't find that you have implemented the BIO tagging scheme for finding antecedents. I have noticed another paper who use this strategy to answer multi-span questions https://arxiv.org/pdf/1909.13375.pdf. I guess it may be helpful to gain further improvements.