DialogRPT icon indicating copy to clipboard operation
DialogRPT copied to clipboard

Performance issues with DialogRPT + DialoGPT

Open pablogranolabar opened this issue 3 years ago • 5 comments

Hi again @golsun,

I've been working with DialogRPT using DialoGPT-large for dialog generation and have hit some performance issues that aren't present when using just DialoGPT-large. Round trip responses using CPU inference are just a few seconds with gpt2-large but whenever DialogRPT is used with the DialoGPT-large checkpoint, performance grinds to a halt. With GPU inference I can run gpt2-large on a 6GB GPU but with DialogRPT I get OOM. I understand that there are multiple models running with the combination of DialogRPT + DialoGPT which is the obvious culprit, is there any way to serialize execution of the two models to prevent these resource consumption issues?

pablogranolabar avatar Mar 01 '21 22:03 pablogranolabar

hi @pablogranolabar ,

I can think of several potential reasons of OOM:

  • torch.no_grad which avoids grad taking memory it was already applied in scorer, but not in generation.py -- I've updated it here, please take a look.
  • number of candidates to be scored -- if it's too large, you can split candidates into several batches and send them to DialogRPT, similar to this
  • if that still doesn't work, I guess you can use two machines, one just for DialoGPT-large and one for DialogRPT, and use API to communicate with each other
  • how many DialogRPT models are you using? I recommend at least updown and human_vs_rand because updown doesn't capture context-response relevance.

golsun avatar Mar 01 '21 23:03 golsun

Hi @golsun, thanks for the quick response!

The two machine idea makes sense, I think I can do that with relative ease if it comes to that.

For the DialogRPT models I am just using updown. So I should ensemble at least updown + human_vs_rand? This application is for a conversational agent that can rerank dialog based on human scoring of the chatbot responses.

pablogranolabar avatar Mar 01 '21 23:03 pablogranolabar

yes human_vs_rand (together with updown)should help in that case. if memory is a concern, a low-memory way without using human_vs_rand is to decode response with small top_k or top_p, this should also help the response to be relevant to context. but I guess the performance depends on the scenarios.....

golsun avatar Mar 01 '21 23:03 golsun

Hi again @golsun. I'm working on ensembling human_vs_rand with updown per your advice, but I'm unsure of the way to proceed with ensemble.yml. Should human_vs_rand and updown be a part of prior with equal weights? Or should human_vs_rand be prior and with updown conditional? Based on the performance reasons above I'm trying to do this with just a two model ensemble as you suggested.

pablogranolabar avatar Mar 03 '21 01:03 pablogranolabar

hi, in this case, I guess a simple way without dealing with ensemble.yml is

# `get_model` and `predict` are functions from score.py
hvm = get_model('restore/human_vs_machine.pth')
updown = get_model('restore/updown.pth')
score_hvm =  predict(hvm, cxt, hyps)
score_updown =  predict(updown, cxt, hyps)
score_overall = np.sqrt(score_updown * score_hvm)   # use this as the final score

I used geometric mean for score_overall, but you can play with some weighted arithmetic mean.

golsun avatar Mar 03 '21 02:03 golsun