inference
inference copied to clipboard
Performance improvement- GPT-J and BERT Offline scenario
The current implementation of GPT-J and BERT carries out the prediction in sequential manner. Could the performance of GPT-J and BERT be improved by implementing parallel processing through threads rather than sequential processing?
GPT-J ref: https://github.com/mlcommons/inference/blob/fa4fe53e53379dee27a216695a2b710d122154c7/language/gpt-j/backend.py#L72
BERT ref: https://github.com/mlcommons/inference/blob/fa4fe53e53379dee27a216695a2b710d122154c7/language/bert/pytorch_SUT.py#L68