torchchat AMD CPU generation is very slow

AMD CPU generation is very slow

Open ianbarber opened this issue 9 months ago • 5 comments

Very slow tokens/second in FP32, feels worse than it should be, but I'm not entirely sure the best way to debug.

$ python3 torchchat.py generate --prompt "hello model" -v llama2 Using device=cpu AMD Ryzen 7 3700X 8-Core Processor Loading model... Time to load model: 2.35 seconds tensor([ 1, 22172, 1904], dtype=torch.int32) hello model

[snip output] Time for inference 1: 1043.69 sec total, 0.19 tokens/sec Bandwidth achieved: 2.58 GB/s Max Sequence Length Reached. Ending Conversation. Average tokens/sec: 0

I will try it on a couple of other dtypes as well, but this feels outside the range of expectations @malfet?

Apr 30 '24 16:04 ianbarber

torchchat torchchat copied to clipboard

AMD CPU generation is very slow

torchchat
torchchat copied to clipboard