PowerInfer icon indicating copy to clipboard operation
PowerInfer copied to clipboard

what is the test condition of Figure 18 in your paper?

Open chenglimin opened this issue 1 year ago • 12 comments

May I ask what is the output length of the experimental results in Figure 18 of your paper? The paper only mentioned that the input length was 1 and 64 respectively, and the batch size was 1, but did not mention the output length. Could you please provide your test conditions?

chenglimin avatar Jan 11 '24 07:01 chenglimin

Thanks for your interest of our work! We tested with the output length of 128.

hodlen avatar Jan 11 '24 09:01 hodlen

k what is the output lengt

Can PowerInfer support API server execution mode?

chenglimin avatar Jan 11 '24 14:01 chenglimin

What are the dtype and activation function of Falcon-40b and OPT-30B when you evaluate vLLM on A100? As far as I know, vLLM does not support Relu, where do you obtain the parameter of Falcon-40b and OPT-30B running on vLLM? can you paste a download link?

chenglimin avatar Jan 11 '24 15:01 chenglimin

Can PowerInfer support API server execution mode?

If you are referring to an API server, yes. You can use examples/server for that purpose. It's basically the same as in llama.cpp.

hodlen avatar Jan 12 '24 03:01 hodlen

What are the dtype and activation function of Falcon-40b and OPT-30B when you evaluate vLLM on A100? As far as I know, vLLM does not support Relu, where do you obtain the parameter of Falcon-40b and OPT-30B running on vLLM? can you paste a download link?

We tested ReluFalcon-40B and OPT-30B in FP16 format with ReLU activation function. vLLM supports Falcon and OPT architecture, and we just need to modify the Falcon's model config to use ReLU. We use OPT-30B as is, and you can download ReluFalcon-40B at Hugging Face.

hodlen avatar Jan 12 '24 03:01 hodlen

where can I download the predictor of opt-30B model?

chenglimin avatar Jan 12 '24 06:01 chenglimin

where can I download the predictor of opt-30B model?

We have not released the predictor of OPT models yet. The sparse inference implementation (code + predictor) for OPT models in PowerInfer is currently internal, reproducible, but not ready for open-sourcing yet. We will release the support of OPT models in the near future, and please stay tuned!

In the meantime, you can try to reproduce the predictor via the method of Deja Vu.

P.S: sorry for overwriting your comment by mistake🙏

hodlen avatar Jan 12 '24 07:01 hodlen

where can I download the predictor of opt-30B model?

We have not released the predictor of OPT models yet. The sparse inference implementation (code + predictor) for OPT models in PowerInfer is currently internal, reproducible, but not ready for open-sourcing yet. We will release the support of OPT models in the near future, and please stay tuned!

In the meantime, you can try to reproduce the predictor via the method of Deja Vu.

P.S: sorry for overwriting your comment by mistake🙏

For Falcon-40B model, do you run with the AWQ or GPTQ quantization?

chenglimin avatar Jan 18 '24 09:01 chenglimin

what is the input-length of Figure 13 for PC-high in your paper?

chenglimin avatar Jan 27 '24 04:01 chenglimin

Please refer to the details as described in the paper for the most accurate information.

For Falcon-40B model, do you run with the AWQ or GPTQ quantization?

For the Falcon-40B model (as well as all others), we run the INT4 model with GGML's INT4_0 quantization method, not AWQ or GPTQ.

what is the input-length of Figure 13 for PC-high in your paper?

The input length mentioned in our paper refers to the number of tokens in the input prompt we used.

hodlen avatar Jan 27 '24 13:01 hodlen

For Falcon-40B, when you compare vLLM in Figure 18, it is mentioned in the paper that vLLM is compared on a single card A100(80G) GPU, but when Falcon-40B (Float16) is directly run on vLLM, the GPU memory will be insufficient. So, the Falcon-40B running on the vLLM in Figure 18 are you using an INT4?

For Falcon-40B model, do you run with the AWQ or GPTQ quantization?

For the Falcon-40B model (as well as all others), we run the INT4 model with GGML's INT4_0 quantization method, not AWQ or GPTQ.

what is the input-length of Figure 13 for PC-high in your paper?

Sorry, I may not have made it clear. What I want to ask is what are the parameters of -t, -p and -n set when PC-High is tested in Figure 13 in the paper? What is the number of tokens in the input prompt you used?

The input length mentioned in our paper refers to the number of tokens in the input prompt we used.

chenglimin avatar Jan 28 '24 13:01 chenglimin

For Falcon-40B, when you compare vLLM in Figure 18, it is mentioned in the paper that vLLM is compared on a single card A100(80G) GPU, but when Falcon-40B (Float16) is directly run on vLLM, the GPU memory will be insufficient. So, the Falcon-40B running on the vLLM in Figure 18 are you using an INT4?

Nice catch! That's a testing detail that we cannot elaborate due to the page limit. A100(80G) cannot hold the entire model for sure. So in the test setting, we skipped the last transformer layer for all participating systems. In this way, the VRAM of A100 was just fully utilized, and the model is still in FP16 format.

What I want to ask is what are the parameters of -t, -p and -n set when PC-High is tested in Figure 13 in the paper? What is the number of tokens in the input prompt you used?

It aligns with all the other tests in our paper: 8 threads(-t), any prompt with 8 tokens(-p), output length(-n) is indicated at the X axis for each case.

hodlen avatar Jan 29 '24 12:01 hodlen