torchchat
torchchat copied to clipboard
Potentially slow when running quantized versions on Desktop CPU
I tried the following four versions on my Macbook Pro M1.
(1) - Really slow
python3 torchchat.py generate llama3 --prompt "Hello, my name is" --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:int8": {"groupsize" : 64}}'
Average tokens/sec: 0.56
(2) - Acceptable?
python3 torchchat.py generate llama3 --prompt "Hello, my name is" --quantize '{"embedding" : {"bitwidth": 8, "groupsize": 0}}'
Average tokens/sec: 4.26
(3) - Slow
python3 torchchat.py generate llama3 --prompt "Hello, my name is" --quantize '{"embedding" : {"bitwidth": 4, "groupsize": 32}}'
Average tokens/sec: 2.31
(4) - Slow
python3 torchchat.py generate llama3 --prompt "Hello, my name is" --quantize '{"linear:int4": {"groupsize" : 256}}'
Average tokens/sec: 2.94
Setup:
git commit: 695a5817224f6fe36f06d63ebadf0dff4aee3e96 python version: 3.10.0 macbook pro M1
Internal Task: T187752023