baize-chatbot
baize-chatbot copied to clipboard
very high CPU during inference. GPU seems to be idle.
I have tried the 8bit option as well but no change.
It generates tokens slowly and CPU goes high (>80%). GPU jumps up too but always < 20%. So it seems to be CPU hungry instead of GPU.
So by default does it inference on GPU?
This seems to be a problem with int8. In our test, it is indeed slower than fp16. We'll have an investigation into this.