torchchat
torchchat copied to clipboard
AMD CPU generation is very slow
Very slow tokens/second in FP32, feels worse than it should be, but I'm not entirely sure the best way to debug.
$ python3 torchchat.py generate --prompt "hello model" -v llama2 Using device=cpu AMD Ryzen 7 3700X 8-Core Processor Loading model... Time to load model: 2.35 seconds tensor([ 1, 22172, 1904], dtype=torch.int32) hello model
[snip output] Time for inference 1: 1043.69 sec total, 0.19 tokens/sec Bandwidth achieved: 2.58 GB/s Max Sequence Length Reached. Ending Conversation. Average tokens/sec: 0
I will try it on a couple of other dtypes as well, but this feels outside the range of expectations @malfet?