parallelformers
parallelformers copied to clipboard
GPT models hang on large token generation. Lower performance?
I am using a 3060 and a 3090 to split GPT models two ways including GPTJ and GPT Neo 2.7B. When generating many tokens, say 500, the model hangs and either takes a abnormal amount of time to finish or does not finish. ( I kill it) Generating 50 tokens does not have this issue.
During this issue, the 3090 memory is pinned to 100% while the 3060 stays low.
Subjectively, especially for GPTJ, the results, while not complete gibberish seem to be of lower quality.
Might this be a race condition between the two GPUs?