PowerInfer icon indicating copy to clipboard operation
PowerInfer copied to clipboard

where can I download the predictor of Relu-Falcon-40B (float16)?

Open chenglimin opened this issue 1 year ago • 7 comments

Is the predictors in link https://huggingface.co/PowerInfer/ReluFalcon-40B-Predictor for Relu-Falcon-40B (float16)? Or Relu-Falcon-40B (int4)? If is int4, where can I download the predictors of Relu-Falcon-40B (float16)?

chenglimin avatar Jan 15 '24 09:01 chenglimin

Yes. All predictors we published are in FP16. To use it with a FP16 model, you can convert the model and predictor into PowerInfer GGUF as mentioned in our README.

If you want to run a INT4-quantized model + predictor, you can quantize the generated FP16 model, and the predictor will be quantized at the same time.

hodlen avatar Jan 22 '24 19:01 hodlen

I convert the model and predictor of Falcon-40B into PowerInfer GGUF as mentioned in your README, and keep the directory as you show in README. However, it come with the following error, failed to offload anything to GPU

ggml_cuda_set_main_device: using device 0 (NVIDIA A100 80GB PCIe) as main device llm_load_sparse_model_tensors: mem required = 62456.22 MB llm_load_sparse_model_tensors: VRAM used: 23903.56 MB ................................................................................................... invoking powerinfer Python module to generate gpu split for 55896.81 MiB of VRAM solver args: Namespace(activation='./ReluFalcon-40B-PowerInfer-GGUF/activation', neuron=32768, capacity=1788696, layer=60, vram_capacity=58612056064, batch=256, threshold=0, output='./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.powerinfer .gguf.generated.gpuidx') Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/chenglimin/speedup/PowerInfer/powerinfer-py/powerinfer/main.py", line 25, in solved = solve_gpu_split( ^^^^^^^^^^^^^^^^ File "/home/chenglimin/speedup/PowerInfer/powerinfer-py/powerinfer/solver.py", line 23, in solve_gpu_split freq, _ = torch.sort(freq, descending=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: sort() received an invalid combination of arguments - got (collections.OrderedDict, descending=bool), but expected one of:

  • (Tensor input, *, bool stable, int dim, bool descending, tuple of Tensors out)
  • (Tensor input, int dim, bool descending, *, tuple of Tensors out)
  • (Tensor input, *, bool stable, name dim, bool descending, tuple of Tensors out)
  • (Tensor input, name dim, bool descending, *, tuple of Tensors out)

llm_load_gpu_split_with_budget: error: failed to generate gpu split llm_load_gpu_split: error: failed to generate gpu split, an empty one will be used offload_ffn_split: applying augmentation to model - please wait ... ............................................................ done (3.64 ms) llm_load_gpu_split: offloaded 0.00 MiB of FFN weights to GPU

Yes. All predictors we published are in FP16. To use it with a FP16 model, you can convert the model and predictor into PowerInfer GGUF as mentioned in our README.

If you want to run a INT4-quantized model + predictor, you can quantize the generated FP16 model, and the predictor will be quantized at the same time.

chenglimin avatar Jan 24 '24 03:01 chenglimin

Can you confirm that your PyTorch version aligns with our requirements.txt? Seems like an incompatibility in PyTorch API.

hodlen avatar Jan 25 '24 13:01 hodlen

Here are content in your "requirements.txt":

"numpy>=1.24.4 sentencepiece>=0.1.98 transformers>=4.33.2 -e ./gguf-py -e ./powerinfer-py"

Here are my package versions:

"numpy 1.26.2 sentencepiece 0.1.99 transformers 4.36.2 "

Can you confirm that your PyTorch version aligns with our requirements.txt? Seems like an incompatibility in PyTorch API.

chenglimin avatar Jan 27 '24 01:01 chenglimin

I tested code around the error shown below, and I believe it's some kind of incompatibility of PyTorch.

# Load and sort activation data for each layer
freq = torch.load(f"{activation_path}/activation_{i}.pt")
freq, _ = torch.sort(freq, descending=True)

We assumed freq is a tensor, and it is, in our environment with PyTorch 2.1.2. But if PyTorch loaded freq as an OrderedDict it can break things. So, can you try with the same PyTorch version and see if the bug still exists?

hodlen avatar Jan 27 '24 02:01 hodlen

My PyTorch version is also 2.1.2, as shown in the following picture. And when I run with LLaMa-13B model,this problem never appear. 1706494637728

I tested code around the error shown below, and I believe it's some kind of incompatibility of PyTorch.

# Load and sort activation data for each layer
freq = torch.load(f"{activation_path}/activation_{i}.pt")
freq, _ = torch.sort(freq, descending=True)

We assumed freq is a tensor, and it is, in our environment with PyTorch 2.1.2. But if PyTorch loaded freq as an OrderedDict it can break things. So, can you try with the same PyTorch version and see if the bug still exists?

chenglimin avatar Jan 29 '24 02:01 chenglimin

Hmmmmm. Were the activation files corrupted, or manually renamed before? They should be the same format for all model architectures. I would suggest you to purge & redownload all these files to make sure everything is clean and as expected.

hodlen avatar Jan 29 '24 12:01 hodlen