PowerInfer
PowerInfer copied to clipboard
where can I download the predictor of Relu-Falcon-40B (float16)?
Is the predictors in link https://huggingface.co/PowerInfer/ReluFalcon-40B-Predictor for Relu-Falcon-40B (float16)? Or Relu-Falcon-40B (int4)? If is int4, where can I download the predictors of Relu-Falcon-40B (float16)?
Yes. All predictors we published are in FP16. To use it with a FP16 model, you can convert the model and predictor into PowerInfer GGUF as mentioned in our README.
If you want to run a INT4-quantized model + predictor, you can quantize the generated FP16 model, and the predictor will be quantized at the same time.
I convert the model and predictor of Falcon-40B into PowerInfer GGUF as mentioned in your README, and keep the directory as you show in README. However, it come with the following error, failed to offload anything to GPU
ggml_cuda_set_main_device: using device 0 (NVIDIA A100 80GB PCIe) as main device
llm_load_sparse_model_tensors: mem required = 62456.22 MB
llm_load_sparse_model_tensors: VRAM used: 23903.56 MB
...................................................................................................
invoking powerinfer Python module to generate gpu split for 55896.81 MiB of VRAM
solver args: Namespace(activation='./ReluFalcon-40B-PowerInfer-GGUF/activation', neuron=32768, capacity=1788696, layer=60, vram_capacity=58612056064, batch=256, threshold=0, output='./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.powerinfer .gguf.generated.gpuidx')
Traceback (most recent call last):
File "
- (Tensor input, *, bool stable, int dim, bool descending, tuple of Tensors out)
- (Tensor input, int dim, bool descending, *, tuple of Tensors out)
- (Tensor input, *, bool stable, name dim, bool descending, tuple of Tensors out)
- (Tensor input, name dim, bool descending, *, tuple of Tensors out)
llm_load_gpu_split_with_budget: error: failed to generate gpu split llm_load_gpu_split: error: failed to generate gpu split, an empty one will be used offload_ffn_split: applying augmentation to model - please wait ... ............................................................ done (3.64 ms) llm_load_gpu_split: offloaded 0.00 MiB of FFN weights to GPU
Yes. All predictors we published are in FP16. To use it with a FP16 model, you can convert the model and predictor into PowerInfer GGUF as mentioned in our README.
If you want to run a INT4-quantized model + predictor, you can quantize the generated FP16 model, and the predictor will be quantized at the same time.
Can you confirm that your PyTorch version aligns with our requirements.txt
? Seems like an incompatibility in PyTorch API.
Here are content in your "requirements.txt":
"numpy>=1.24.4 sentencepiece>=0.1.98 transformers>=4.33.2 -e ./gguf-py -e ./powerinfer-py"
Here are my package versions:
"numpy 1.26.2 sentencepiece 0.1.99 transformers 4.36.2 "
Can you confirm that your PyTorch version aligns with our
requirements.txt
? Seems like an incompatibility in PyTorch API.
I tested code around the error shown below, and I believe it's some kind of incompatibility of PyTorch.
# Load and sort activation data for each layer
freq = torch.load(f"{activation_path}/activation_{i}.pt")
freq, _ = torch.sort(freq, descending=True)
We assumed freq is a tensor, and it is, in our environment with PyTorch 2.1.2. But if PyTorch loaded freq
as an OrderedDict
it can break things. So, can you try with the same PyTorch version and see if the bug still exists?
My PyTorch version is also 2.1.2, as shown in the following picture. And when I run with LLaMa-13B model,this problem never appear.
I tested code around the error shown below, and I believe it's some kind of incompatibility of PyTorch.
# Load and sort activation data for each layer freq = torch.load(f"{activation_path}/activation_{i}.pt") freq, _ = torch.sort(freq, descending=True)
We assumed freq is a tensor, and it is, in our environment with PyTorch 2.1.2. But if PyTorch loaded
freq
as anOrderedDict
it can break things. So, can you try with the same PyTorch version and see if the bug still exists?
Hmmmmm. Were the activation files corrupted, or manually renamed before? They should be the same format for all model architectures. I would suggest you to purge & redownload all these files to make sure everything is clean and as expected.