PowerInfer
PowerInfer copied to clipboard
fix a bug when calculating `neuron_cap` before invoking the solver
For example,
ReluLLaMA-7B; NVIDIA GeForce RTX 2080 Ti 11264MiB; ffn_up,ffn_gate,ffn_down_t all are[4096,11008];
A neuron should be [4096,1] not [1,11008].
when env CUDA_VISIBLE_DEVICES=0 ./build/bin/main -m ./ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" :
-
before revising:
slice_size=22016vram_bytes_per_slice=99072vram_allocatable_bytes=4212178944neuron_cap=170064 -
after revising:
slice_size=8192vram_bytes_per_slice=24576vram_allocatable_bytes=4212178944neuron_cap=171394